Did your site crash? Blame the architect, not the foundation.

On the morning of February 28th, a typo broke the Internet. Well, to be more precise, an erroneous command dropped some critical infrastructure within Amazon's AWS S3 service in the US-EAST-1 availability zone. If you really want to nerd out, you can read about it here.

Because of this tiny mistake, a cascade of failures followed, knocking hundreds of websites and other services offline for about four hours. For the thousands of people who blamed Amazon for the sites going offline, I say "phlybbt." (That is, I understand, how one types the characters that represent the sound of a "raspberry.") Of course, it was a stupid mistake. But here's the thing... Stupid mistakes happen all the time. "Human Factor Engineering" is a very fancy term for "reducing the impact of humans' inevitable stupid mistakes." But if you design a system with single points of failure, you should be confident that your system will fail at some time.

To be honest, I architect systems all the time with single points of failure. We say that "we understand the risk, and accept the likelihood that, at some point, our application will fail and be unavailable for several hours." We do it to save money. Period.

I've also built 9-1-1 systems, and technology that sits behind the 9-1-1 network to keep it running. Downtime in 9-1-1 is a big problem because, when you drop a call, there's a chance that someone is going to die. That's pretty bad. I mean it's like almost as bad as being without Reddit for four hours. But engineering to reduce the impact of failures (because it's not that we think we can prevent them -- it's that we anticipate them) is a good part of any architecture process. If there is a buried cable, Backhoe Bob will find it. If there's a file server that's lost the "redundant" part of RAID, the tech replacing the bad drive will spill his coffee into the machine. So plan for things to break, and be ready.

The funniest thing was the incredible number of Tweets from people smugly suggesting that the mistake was in using Amazon instead of Microsoft Azure. Because, well Microsoft Azure has never had a failure of any kind, right? That's like saying "You should always fly Аэрофлот, because United Airlines had a plane crash." 

What we should be saying is "everything breaks eventually. Let's plan for it, or accept that when things break, we may just have to spend a few hours going for a hike."