Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Okta Outage (okta.com)
130 points by hunter2_ on Dec 15, 2021 | hide | past | favorite | 82 comments


okta passport is such garbage. it's an absolute steaming pile of garbage sold to unwitting enterprise clients who can't identify bad technology.

why do i say this? At least as of Oct 2021 Okta didn't even have a complete compositional API to setup accounts -- thus requiring a 'josh-api', literally a guy named josh to manually provision new customer accounts by hand. the latency on those api requests was immense.

the company i was working at still depended on java 8, never budgeted time for refactor or maintenance, and had a plurality of other horrible dev practices they justified (calling themselves agile, devops, but not actually doing any of those things properly)

i was still in my probationary period when told me they were going to roll out Okta to all clients early in 2022 and charge for SSO so they could line their pockets, i gave my notice the next day (for many reasons, including okta). josh also gave notice and left on the same day.

given log4js, etc. this has probably been an extremely bad week there.


I’m actively looking now for an SSO solution, SMB scale. What’s the one that doesn’t suck?


Keycloak[1] is self-hosted and widely used. Kratos[2] is also self-hosted but API only, but on the plus side won't have problem with "josh-api" like GP described above.

[1] https://github.com/keycloak/keycloak

[2] https://github.com/ory/kratos


Is JumpCloud any good? (not a recommendation)


@PostMapping(value="/createAccount")

public String createAccount(String user, String pass){

   var hashmap = readUserMapFromDisk(MYUSERS.PATH_TO_TXT_FILE);

   hashmap.putIfAbsent(user, pass);

   writeToDisk(hashmap, MYUSERS.PATH_TO_TXT_FILE);

   return "All done " + LocalDateTime.now();
}

EDIT: Just imagine the string args are in a more complex object that can be serialized to JSON


What is Okta Passport?


I know there are complexities involved, but auth is one of those things that needs to be very insulated from single region issues. How long was Okta not working for customers?


Someone should tell Amazon... IAM is single homed to us-east-1 AFAIK.


(I used to work at aws)

The control API (i.e. adding/removing roles, modifying policies, etc.) is available out of us-east-1. However, the bits of IAM that relate to distributing credentials to instances/tasks/lambdas and STS are all regionalized and isolated.


So...parent is correct.


Disclaimer: Former AWS engineer, never worked on IAM directly.

AWS is divided into multiple partitions. For the vast majority of users, there is one partition - the regular commercial - other partitions being China, GovCloud, etc.

Within each partition, there is a primary region that needs to be available for creation/mutation of credentials and policies. However, that data is replicated to other regions within the partition. That means the use of credentials that exist does NOT depend on the primary region being available. The replication is something that is closed monitored, and SLA breaches will result in pages.


How about credential revocation, is that dependent on the health of the primary region?


Not really. The parts that can take your app down are distributed.


no, because most applications don't have an online dependency for creating roles and modifying policies. what they do typically have an online dependency for is provisioning credentials from those roles, which is architected to be regionally independent.


If you only knew how bad modern software infrastructure really was


Yeah, for a similar vendor, it's interesting to read this page and look at the diagram:

https://auth0.com/availability-trust

And then read this tweet:

https://twitter.com/auth0/status/1471159935597793290

Edit: Ah, seems they picked us-west-1 and us-west-2 as the two regions..."In this case, we use two AWS regions: us-west-2 (our primary) and us-west-1 (our failover)."[1] So bit by a double-region failure.

[1] https://auth0.com/blog/auth0-architecture-running-in-multipl...


Now that Okta bought Auth0 what's the developer experience like I wonder? I imagine the infrastructure of the two products are still completely isolated. But is it still a separate product you can use for identity management or are new customers forced to use Okta?


Auth0 customer: they kept everything isolated (and promised it will stay like that for a while)


> for a while

What else is everyone using?

Any thoughts on how to future proof this?


Looks like future-proof already, for a while.


It's still running on a Perl script I wrote over a decade ago.


Looks like about 45 mins


Seconded


Cisco Duo SSO/MFA was also out this morning during the AWS outage. I guess usable redundancy for these services is a difficult problem.


So glad that half my company's internal stuff falls apart when some external service goes down... I couldn't even submit tickets to my company's helpdesk about the other networking issues I was having. What a joke.


But, isn't that what everyone on hackernews keeps advocating for?

The notion of self-hosting is that you have to hire expensive operations staff and maybe you have sub-par experience due to lack of investment.

There is often an argument about "lack of core competency" too.

Facebook famously runs their own infra but that didn't work so well for them.

--

FWIW I'm actually of the other notion; I truly believe that you should minimise external dependencies. But that's because I'm a sysadmin (now: SRE) and it's my job to worry about reliability of systems. Less complexity and less external dependency ususally coincide with higher reliability.

A person could reasonably argue that it's in my interest to prefer companies run their own stuff, since it might be my job to maintain it, so it's self-serving. So I am not unbiased I suppose.


Auth services need to be engineered for at least five nines if not six. System design fail.


You can engineer for any number of nines and still have massive outages.


What I like to call "nine fives".


With that logic, you can do anything and it’s A-OK if nothing you do succeeds.


That's not the point. 9s and SLAs are contractural obligations, not laws of physics that cannot be broken.

I saw many systems engineered for a lot of 9s be down hard for much longer than 9s promised, most often due to perfect storm of issues.

9s are great, and communicate pretty well what system was designed for, but they are in no way hard guarantee that the system will be up.


Can confirm. This is how the world works smh.


I mean...you tried. What are you supposed to do if you fail, quit and never come back? That's terrible logic too.


OKTA SLAs and support terms specifically exclude AWS outages, so why would they?


Ahh, but how many of the nines need to go on which side of the decimal point?


They are all on the same side, obviously. But you have a point, nobody ever said there were only nines there.


Probably this isn't their only outage of the decade, but if so, 45 minutes out of ten years actually is five nines.


TIL okta is in us-west-2


Okta uses nearly all the us regions, with older (and larger) customers in us-east-1.

I used to work there and know the internals well. These aws outages must be causing massive chaos there.



Given they were down I would say there was, in practice, no redundancy. Simply claiming something doesn't make it true.


I don't know the details but they don't fail over automatically. There has to be a reason to push the button. Perhaps the reason did not exist this time. But I know for sure that there is redundancy.


Redundancy is only relevant if it helps you in an outage. Otherwise it's just a pointless marketing term no matter how much effort you put into it.


So Okta's redundancy is about as reliable as AWS's status page?


Heh. Our company just switched to their TFA from MS as of this morning. Poor timing.


What was the business rationale for such switch? I usually see people migrate toward MS auth not away from it.


We're moving everything that is still tied to it away from MS. The last products we have to solve are Dynamics and Excel but that scope is so small compared to everything else that it might not matter to leave those as-is for now if we can get at least Dynamics as SaaS and ditch AD (which only remains for Dynamics).

MS doesn't do the things we need in a better way than other options, and it's almost always more expensive at product level and TCO level.


I’ve found the opposite — as long as you’re happy with staying within the MS ecosystems. So Azure, Office 365, Teams, etc…

You hear people calling Microsoft expensive when they’re on some random mix of Gmail, Notes, CM9, or whatever.

Then MS seems expensive because it’s all or nothing. Dipping your toe in the water turns into a dive to the bottom of the pool.


Many vendors are like that. IMO a vendor that can give you a good deal but only if you use them for everything is second-tier compared to a vendor who you can mix and match individual products from at a decent price.


To be honest not a lot of our users and administrators have actually been happy in the MS ecosystem. There are a few outliers, some licensing middlemen, a few MSPs and a couple of hardcore Excel number crunchers that use the Axapta or Dynamics connectors. But you can find those anywhere like with SAS and SPSS. A lot of users don't really care at all so that just makes it a cost and 'does it do the bare minimum'-deal for them.

A few people that really invest and enjoy a specific application does not make it great, especially when it turns our they are just doing more than they should be doing; i.e. when you have an InDesign professional that would be typesetting materials for publication but the person that writes the copy is also trying to 'typeset' the source in Word. It's great if you then feel like Word gets you cool typeset documents as a power user, but if 9999 people in a 10k company don't do that and just let the publication team do that properly in InDesign according to the media standards, it's no reason to keep it as a default available application.

A lot of the usage comes from "well, it was already there so I went and did it in that". Not because it was actually the standard, best choice or in scope of the task that was supposed to be done.

Same goes for things like notes and documentation:

- Code-level docs go in the repo (MD, RST mostly) - Org-level docs go in the wiki (Confluence) - Publications are delivered as copy to the publication team which then uses the DTP/typesetting thing of choice

Yet someone who would ignore that creates extra work by doing it in a different application first, then copying it around and converting it. That means that the person/process needs to be fixed, and doesn't mean we need Word as an expensive WordPad/Pages replacement.

Now, this might not apply to things like mini-orgs inside a bigger org, or very small companies and individuals. But I wasn't writing about those anyway ;-) At that level you don't really have the size and scope to make good choices anyway, and you're best off just sticking with one big vendor, not because they are the best, but because you won't be handling multi-vendor management anyway.


> couple of hardcore Excel number crunchers

Well, you see... that's just it.

There is no direct substitute for Excel offered by any other vendor.

Similarly, PowerBI has no direct competition with even a tenth of the capabilities. Before then Analysis Services + Excel was absolutely the best, and nothing else could hold a candle to it.

Active Directory + Group Policy had no viable competition, and still doesn't.

For orgs that must be on-prem only, Microsoft Exchange only had Lotus Domino as a vaguely equivalent competitor. There are no open-source equivalents.

InTune + Windows + Azure AD + Hello 4 Business is hard to beat. You can assemble a mish-mash of vaguely compatible products, but it's a lot of work.

MS Teams is hard to beat for large enterprises because of the deep integrations.

Etc...


There are plenty of substitute for everything, it's just that people don't like change.

We have just as many hardcore Databricks users that wouldn't want to move, or hardcore MATLAB users and Mathematica users. We don't have anyone using PowerBI anymore, those all moved to Databricks and Tableau.

Active Directory and GP are a burning trashfire and only can't be replaced if you're stuck without modern MDM on a Windows Desktop construction. The only Windows we have left is VDI based on Citrix and Ivanti, everything else is "do whatever you want" where users have a BYOD choice with no internal access (and in reality they don't need it anyway) or VPN access with a choice of Kolide-based compliance or MDM-based compliance, and either are used to gate connections, in combination with DLP and standard anti malwares.

InTune sucks, Azure AD is nice, but when attempting to integrate with everything that already exists it sucks again. We have everything that is fully-managed or half-managed on JAMF and everything else is just isolated or internet-only (which was a requirement starting 2 years ago anyway). MS Teams never got a foothold here, sucks in so many ways people just rather have physical meetings or use email. Slack on the other hand works well and has been in constant use for over 5 years now. On-prem exchange was deleted and migrated to Google's thing a few years ago, works fine, does everything we need for quite a low price and quite high user happiness. It also ended up being the directory replacement. We no longer need Kerberos, except for some legacy Windows-desktop applications, but those are stuck inside VDI anyway until we can move on.

We used to have large VMware server farms and physical oracle boxes too. The first category was moved to AWS or replaced with microservices on Kubernetes, the second one migrated to Postgres RDS in AWS, but that is an ongoing process (roughly 10TB in table size done so far, and yes that did take tweaks like adding actual indices where Oracle would automagically do that on Exadata machines for you). Even with the human investment the cost is lower, productivity higher and we are less constrained by contracts. The best part is the elasticity we gained which was always problematic, even with the fake cloud (vmware-on-aws for example) or pretend-to-be-hybrid cloud solutions (azure) that never bear the fruit they advertise, or do but aren't actually better in the end.

Perhaps a big difference between the projects at this company and others is the vertical integration where things that are distinguishing to the company are pretty much created from scratch by internal development and maintenance teams. Things that don't matter or are shit no matter how you do it (looking at VDI) are delegated to MSPs but they have to run on our infra so we know the state of the infra no matter what the MSP is trying to tell/sell us. At the end of the day, this works great for us, everyone is happy, and a profit is made. This is in Western Europe if that makes a difference.


According to our CTO, it was related to security. With Microsoft, the TFA options did not include a hardware token. So now we can authenticate with a phone call, a text message, or a security token. ( https://www.okta.com/identity-101/security-token/ )

The main advantage is that the hardware token can be used in areas where mobile phones are prohibited, and of course immunity from a SIM swap attack.


They do support hardware tokens. I use a Yubikey with it. However, support is spotty outside of Windows.


There may have been other reasons not mentioned, such as Microsoft's tiered services model. It sometimes seems as if they deliberately provide poor solutions at the lower tiers. The USG is pretty pissed off about that.

Also, Yubikey would would not work, because like mobile phones, USB devices are restricted in some areas.


> Also, Yubikey would would not work, because like mobile phones, USB devices are restricted in some areas.

What kind of token would work, then? Something that only generates a TOTP, like those fobs some banks used to give out?


Yes. Same method as Google/Microsoft Authenticator, but implemented in a separate hardware device (fob).


Ah, Microsoft has this in public preview at the moment so full support looks imminent.


Link please so I can track?


Microsoft's competitor to Okta is Azure AD. Why would authentication factor support rely on client OS?


I agree with you, it shouldn't, as evidenced by another of their properties which works fine: GitHub.

I don't know the actual answer to this question. My speculation is they don't want to have to support too many platforms, so they just flat out refuse to serve them.


I had to use Okta at a company once. I asked myself the very same question you posed, every single day.


I've used Okta at two companies now and I found it fairly pleasant. Most issues were around people getting locked out in my experience.


Is MS auth better? Does that mean Active Directory?

We're in the middle of a migration from in-house auth (which we need to get rid off) to Okta and I think the people involved are finding Okta pretty confusing. But it's a big product and auth stuff is complicated, so I'm not sure how much it's Okta's fault.


In the context of Okta it's probably AzureAD. But yes, it's related to Active Directory, you can easily sync the two. It's probably why many companies use it: it's easy to add on to your existing Windows infrastructure.

My client uses it, it works mostly well. It does have its annoying limitations, though, such as no group inheritance and limited support for hardware tokens outside of Windows (no support on Safari/iOS, Safari/macOS, Firefox/Linux).


Add to that, based on the numerous implementations I've seen of Azure AD, Microsoft doesn't have a sandbox/preview instance (or it's cost prohibitive), so testing it with your app before going live with your new app is ... challenging.

Okta does provide one at a reasonable cost, so it's easier to test with your new app deployments.


> Add to that, based on the numerous implementations I've seen of Azure AD, Microsoft doesn't have a sandbox/preview instance (or it's cost prohibitive), so testing it with your app before going live with your new app is ... challenging.

I haven't seen any sandbox feature either. The way we handle this is by creating a "test" app if it requires special rights, so we end up with SomeApp-test and SomeApp-prod. I don't know if there's any limit on the number of "apps" you can have.


We are big in AzureAD (and getting bigger) and this is how we do it, otherwise our non prod systems will need non-prod UAT users, and when the cloud AD is tied to on prem traditional AD, that’s really hard for non-technical users to test (but we do have a test AD/AzureAD for testing changes to the directory itself).


What do you mean hardware tokens outside of windows? For primary workstation authentication? Or AzureAD auth? By hardware tokens do you mean RSA or TOTP hardware fobs?


I mean for AzureAD auth, so for "applications", as in OAuth/OIDC. Although I'd like to be able to use the key for primary workstation auth, as I can do on Linux.

By hardware tokens I mean U2F, in my case a Yubikey.


AAD allows you to use FIDO2 for OIDC, with specific browsers being support cross-platform: https://docs.microsoft.com/en-us/azure/active-directory/auth...

It's not 'u2f' exactly, but better (imho) for most people.


> with specific browsers being support cross-platform

This is why I said support was "limited". Basically, only Chrome is supported cross-platform, which I don't use anywhere.


In general I'd say yes. AD syncs to Azure totally painlessly and MFA "just works" if you turn it on.


I think that depends on many factors, like where you work, what class of companies you work for, etc. For example, I have not even seen a Windows machine in 10 years at work. I can't think of anyone in my professional circle either who would suggest using any MS product.


I saw a Linux machine recently… I think.


I think this is why having the ability to self-host is worthwhile. This option gives you flexibility to bring this stuff in house if you want to build a team to operate it (or put it on a current team's todo list).

Should you? I don't know your situation and whether you can build an Okta-caliber level team internally. (My guess is that many smaller or non-tech focused orgs would have a hard time with that, but that's just a guess.) It's a hard question worth asking.

It's easy to think "we could have done better" when things are on fire, as opposed to all the times when the status chart is all green and you don't have to think about Okta (feel free to s/Okta/other service provider/) at all.

Disclosure: I work for FusionAuth, an auth provider that has both SaaS and self-hosted installation options.


When I worked for an event services company and a fairly large SR22 / budget insurance company the better question is not cloud or no but really hybrid cloud systems. The first one had a rather long downtime event that they invested in a remote failsafe. The insurance company was largely resilient and when power went out the servers were up but the people doing work weren't able to come there. They bought a generator but they weren't able to turn it on.

Blaming Okta or any other group isn't the issue. Your customers don't care how you are down they only care if you are down.

Also, I got a bill from Amazon when I forgot to shut down a pagemaker instance and that cost me $700. I self host now buying a business internet package with a static ip. I also upgraded the machine but it wasn't necessary and in hindsight I shouldn't have done the upgrade but just fix the case.


rolling your own auth is underrated.


You still have to be more available then they are. Seems like a poor problem to tackle.


What's your stack for this


Oh. It wasn't just me.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: