To Minimize Downtime, Prepare for Chaos

It’s 2017, which means there's a good chance that a non-trivial portion of your business depends upon the cloud. Adopting a cloud strategy for infrastructure management brings with it numerous benefits, including increased flexibility, scalability, and potentially reduced IT costs—but as we saw with this week’s AWS outage, even the most reliably consistent service providers can have a bad day. A disruption in service can potentially cause millions in direct revenue loss, as well as immeasurable damage to a company’s brand reputation. The good news is that there are precautions you can take to decrease the negative impact of such an event.

The Damage Report

Amazon Web Service (AWS) is the most ubiquitous cloud infrastructure service in use today, boasting over 40% of global market share. The fact that AWS is so omnipresent behind the scenes is not cause for concern in itself; AWS has been among the most predictable cloud service providers, consistently exceeding its Service Level Agreement goal of 99.9% uptime. That being said, no service is perfect, as customers of AWS’s Simple Storage Service (S3) realized first hand.

At just before 11:30 a.m. on Tuesday morning, AWS’s S3 cloud storage service in their US-EAST-1 (Northern Virginia) Region went offline. The damage was swift and impactful. AWS released a statement after service was restored, noting that the outage was neither intentional nor the cause of a corrupted system, but rather the result of a simple mistyped command. That’s right, a completely unintentional typographical mistake^[1] when entering a single command caused the likes of Adobe, Slack, Expedia, and even the U.S. Securities and Exchange Commission to experience crippling performance issues. Some retailers were even knocked out entirely during the affair.

It’s too early to tally the actual monetary damages caused by the nearly five hours of downtime, though early estimates have put losses in the tens of millions, with hundreds of thousands of users affected.

How It Happened

There is no question that the source of the outage can be traced back to actions taken by AWS, but the fact that S3 went down does not in itself explain why the damage rippled across the Internet. While many companies who relied on the US-EAST-1 Region’s services took a hard hit during the outage, other companies were partly or wholly unaffected. Why is that? There are a few factors that came into play.

Hidden Dependencies

S3 is a straightforward tool that has become a fundamental component in most cloud-based systems. Because of its wide use, numerous other, often complex, services are built on top of it, which compounds the effect when S3 goes down. The level of direct or indirect dependence a site or service maintains on S3 would have potentially been a factor for loss of site functionality.

Network performance monitoring company Thousand Eyes outlined three ways that a business’s S3 dependencies would have potentially faltered during the outage:

A company whose static webpage was directly and solely hosted on the affected S3 servers would have experienced a total loss of use. Lululemon Athletica Inc. was among the unfortunate companies in this category.
Partial loss of use would have occurred if some of the elements on the page depended on S3, or on other services that transitively depended on S3. An example of this was Slack, which was mostly functional, but left users unable to upload files.
Critical services of the application may have depended on the S3 or other AWS services that were affected, causing a partial or complete loss of use. One of the crafters at 8th Light detailed his use of an AWS Lambda function that plugged into a web application firewall to rate limit malicious users, but which was rendered unusable during the outage, since the function required writing to S3. His advice: “Be aware of the dependencies of your dependencies.”

In a classic bit of irony, AWS was unable to update their status dashboard during the affair, as it also relied on S3 for graphic storage, which meant that downed services were showing up as healthy during the outage because the correct graphic could not be rendered to show otherwise. Talk about a hidden dependency!

The takeaway here is knowing where the risks lie, and purposely planning for them. It may be helpful to think of any remote dependency as a potential point of failure, but specifically in the AWS case, a basic analysis of your remote dependencies could reveal opportunities for reducing the need for such a dependency in the first place.

Basket of Eggs

AWS operates in 16 different geographic regions across the world. In the U.S., there are four regions—Northern Virginia, Ohio, Oregon, and Northern California—with the remaining zones distributed across Europe, Asia, South America, and Australia.

When you publish a resource to one of AWS’s services, you have the option to select to which independent region it will be deployed. An obvious consideration is proximity to a region, as the closer you are physically to a resource, the quicker it will be to access it. In a perfect world, this sole strategy would be fine. In our real world, where humans and computers operate with a limited capacity for understanding each other, creating redundancy by replicating resources across regions, or even between cloud service providers, could mean the difference between business continuity and profit loss.

Concentrating all of a business’s resources in one region can spell trouble in the rare event that an outage of this scale occurs. In Tuesday’s outage, only one region was affected, but for companies whose resources were focused in that region, this was their crux.

Prepare for Chaos

I mentioned earlier that some companies were able to weather the storm and emerge relatively unscathed. That’s because companies that deploy services with the expectation that something, some day, will go wrong are able to better prepare for such storms.

While AWS was dealing with its pains, Amazon’s retail section, built with a reliance on S3, did not experience a hit during the service interruption. Of course the purveyor of AWS would take all necessary and suggested precautions in order to maintain the health of their flagship property.

Then there is the gold standard of chaos planning and prevention, Netflix. They employ a collection of in-house cloud-testing tools dubbed the “Simian Army,” which were specifically developed to simulate havoc on the system in order to catch any potential weak spots that could cause service disruptions or performance issues. By planning for a range of issues, small and large, Netflix is able to anticipate their system’s real life responses to said issues, and consequently build safeguards to keep the app running during those occurrences.

Preparations for outages should not be made with a one-size-fits-all approach. Both Amazon and Netflix maintain massive amounts of data and boast hundreds of millions of users, so not only must they take the extra precautions, their solutions will likely look a lot different than those for simpler applications. Factors such as the volume of data stored, the effort and cost of protecting against different scenarios, and loss-risk calculations should all be considered when deciding how to protect against third-party failures.

On a smaller scale, the solution may not be complete prevention, but rather graceful degradation. It could mean that certain elements of the application integral to the app’s functionality are prioritized for fault tolerance. It could mean moving assets off of the remote resource and accessing certain assets from a local disk.

The S3 outage exposed deficiencies in architecture designs across the web, and while AWS has outlined changes to prevent future damage at the scale at which this occurred, it is both important and wise to take the extra precautions to shield your business from loss when another calamity occurs. We learned last week that a typo can take down part of the Internet; better preparation can minimize the damage when it happens again.

^[1] We may be tempted to brush this incident off as a simple case of human error, but that would downplay the complexity of the situation for which one singular action was not the cause. For more perspective, check out The Field Guide to Understanding Human Error.