When I was starting my career, in a “traditional” organization, there were different departments.
In a basement with no natural light, affectionately called the dungeon, we had System Administrators performing magic on Linux. In another area of the dungeon we had developers, bashing out reams of code at a rate of 1000 lines a minute. It smelled like pizza and worse.
On another floor, a little further up with a view of the car park, we had our testers. The developers would code, then send it over to the testers.
All of this was known as “Waterfall,” and we came to the conclusion in 2001 that this didn’t work too well. That’s how the Agile Manifesto was born. (Did you know 8th Light hosts this?!)
Along came DevOps, based upon the principles of Agile to address shortcomings of separate “Operations,” “Development,” and QA teams. The benefits have been well researched and documented in books such as “Accelerate”.
Amazon Web Services (AWS) defines DevOps as: “the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity.” They go on to describe DevOps as:
“development and operations teams are no longer “siloed.” Sometimes, these two teams are merged into a single team where the engineers work across the entire application lifecycle, from development and test to deployment to operations, and develop a range of skills not limited to a single function.”
DevOps addresses the problems with silos and improves the frequency of code deployments, hence giving clients the features they want quicker.
As a practice, DevOps changed the way a lot of folks throughout the delivery cycle work on projects. DevOps also introduced the need for specialists who focus on monitoring and responding to the fundamental challenges that the practice exposed.
In this article I will talk about Site Reliability Engineering, what it means and where it comes from, and provide an overview of the metrics and tasks that Site Reliability Engineers focus on, and how they help teams work together and create a more reliable application.
Site Reliability Engineering
Site Reliability Engineering is a practice invented by Google and popularized in the book of the same name. They took the principles of DevOps and created a role based upon them. It has been said that if you could describe it with Java code, then DevOps would be an interface, then SRE implements that interface.
As the name suggests, Site Reliability Engineer is a specific role that allows engineers to focus on the reliability of the service. It is not a new name for a “DevOps Engineer,” which in turn was not a new name for a “System Administrator.” So how do we know if we should focus on reliability?
No service is 100 percent reliable. Expectations of 100 percent reliability are unrealistic. Your favorite restaurant is not 100 percent reliable. Most of the time you’re happy with the service and the quality of your Chicken Jalfrezi. If 1 out of 20 times the service is slow, you’re probably going to forgive the transgression and go back. If however, they get a new chef and the quality goes down, you stop going.
Similarly, the mobile network infrastructure is not 100 percent reliable. If it were possible for your favorite e-commerce site to have 100 percent uptime,, you wouldn't experience that on your smartphone because you’re reliant on a mobile network that isn’t 100 percent reliable.
Because we tolerate a level of unreliability, we need to know what the threshold is. The users of your service are the ones who tolerate it, so any metrics must capture their point of view.
Building The Reliability Stack
In his book ”Implementing Service Level Objectives," Alex Hidalgo coined the term "Reliability Stack," which describes three concepts that allow developers to monitor how reliable a service is. These are “Service Level Indicators,” “Service Level Objectives,” and “Error Budgets”. Error budgets are built upon SLOs, which are in turn built upon SLIs.
Service Level Indicators
Service Level Indicators (SLIs) are a particular metric from the users’ point of view, for example:
“The time it takes to receive my Chicken Jalfrezi at my favorite restaurant.”
“The time it takes to load a page (the latency).”
“The number of good versus bad login requests on the site.”
The metrics shown in examples 2 and 3 can be obtained from data sources such as Prometheus and other such tools. On their own they don’t provide the reliability figure we would like, so teams use SLIs to build their Service Level Objectives (SLOs).
Service Level Objectives
A key tenet of Site Reliability Engineering is Service Level Objectives. SLOs offer data on key user journeys that have been agreed on with the business. Without SLOs, there’s no scientific way to know whether or not reliability has improved.
Example SLOs for the above SLIs could be:
“The time it takes to receive my Chicken Jalfrezi should be less than 15 minutes 99% of the time.”
“The time it takes to load a page should be less than 3 seconds 99.5% of the time over a 7 day period.”
“The number of good vs bad login requests on the site, 99% should be good over a 28 day period.”
For the second example, if the page load time should be less than 3 seconds 99.5% of the time over a 7 day period, another way of saying is that we tolerate page load times of more than 3 seconds 0.5% of the time. This equates to tolerating unreliability 50.4 minutes in every 7 days.
7 days = 10080 minutes.
0.5% of 10080 minutes is 50.4 minutes.
This is what is known as an “error budget.”
Error Budgets are calculated using SLOs, and they show the thresholds that SREs can measure against. Like all budgets, this one is designed to be spent! Teams can use this to account for unplanned unreliability. They’re also helpful for factoring in planned maintenance, such as releasing new features, upgrading components, or doing some chaos engineering. When teams use up all of this error budget and exceed it, then they are violating this SLO and know they should focus on the system’s reliability, rather than releasing new features.
Obviously these SLOs have to be agreed on with the business, as it effectively means that no, you cannot release new features. It’s important that all stakeholders agree to this culture of SLOs or the whole process is pointless.
How Reliable Should You Be?
This question is commonly known as the “number of 9s,” as the answer is a reliability percentage that adds more 9s to 99 percent.
Earlier in this post, I talked about percentages and error budgets. In that example I talked about unreliability in a 7 day period (10080 minutes). After doing some math, the following table shows the number of bad minutes in that period that the team can tolerate if they add more digits after the decimal point.
|Target reliability percentage||"Good" minutes over 7 day period||"Bad" minutes (Error budget) over 7 day period|
If you have an SLO with a figure of 99% (two 9s), then the chart shows you can tolerate around 1 hour and 40 minutes (100.8 minutes) of unreliability. This allows a fair amount of flexibility with unplanned downtime and doing releases. You might say it’s a bit lenient.
Adding an extra 9 (three 9s) after the decimal point allows you only around 10 minutes of bad minutes within a 7 day period. That is quite a lot more stringent than before, and might not give you the leeway to do your updates.
Taking the last one, 99.999 (five 9s), you’ve got about a 10th of a minute per week to be unreliable. This is a very high bar indeed, and probably extremely unrealistic and impractical for most businesses. The more digits you add after that decimal point, the more effort and expense you have to put in to achieve it. As you can probably see, it is exponential.
Whichever figure you choose, it should be a conversation with your users, as it is they who really define what reliable is.
How Does an SRE Manage Reliability?
Site Reliability Engineers leverage a combination of tools to provide a holistic view of their system’s health and performance. These approaches combine monitoring and observability tools with custom metrics and objectives that ladder up in granularity in a structure that parallels with how developers architect
Many organizations now use tools such as Prometheus to provide monitoring of their services. They then typically build dashboards on top of these metrics with Grafana so they can monitor for special events. There may be some alerting set up to page someone when some threshold is exceeded. These alerts are kind of like a “unit test” (for production) in that when some condition is true, they raise an alert. These metrics are often quite low-level (such as CPU usage, or disk space), and are not geared around user journeys or something that “the business” might recognise. In some organizations, even the developers do not know what these metrics are for. They are not ideal for telling us if our service is reliable or not.
On the other hand, SLOs are higher level, and are there to test a journey from the user’s perspective. In some ways they feel a bit like an “acceptance test.” An acceptance test proves that some feature of the site works to specification so the client "accepts” that this works.
As an analogy, this feels a bit like the “test pyramid” (taken from Martin Fowler’s blog):
Unit tests are at the bottom of the pyramid and show that we have more of these types of tests. They are quicker to execute and are smaller in scope.
At the top of the pyramid in this picture we have UI Tests (or acceptance tests) and we have a lot less of these because the scope of them is much larger and they execute much more slowly. Tests at this level tend to be focused on business features and are written in a way that is meaningful to ‘the business’ (in a good implementation of the test pyramid).
For the “monitoring pyramid,” SLOs sit at the top, signifying there are fewer of them. These focus on key user journeys, so while there are less of them, they would be larger in scope, similar to UI or acceptance tests.
The pyramid next shows metrics (such as Prometheus) beneath SLOs on the pyramid to signify that there are more of them and they are smaller in scope. Like a good testing strategy with different types of tests in the testing pyramid, a good observability strategy should include different types of monitoring. Just as UI or acceptance tests are not a replacement for good unit tests and instead complement them, SLOs are not a replacement for metrics. They complement them.
A lot of distributed architectures also employ tracing (using tools such as Honeycomb or Lightstep) and logging. Quite often, there is a lot of this instrumentation, so it belongs at the bottom of the pyramid.
Service Level Agreements (SLAs)
It is worth mentioning at this point the similarities between Service Level Objectives and the Service Level Agreements (SLAs) that most companies are familiar with. SLAs are a promise to a client to guarantee a particular level of service. If you fail to meet that promise, there is usually a penalty of the monetary kind. Clearly you do not want to breach your SLAs.
Ideally, if you are going to breach anything, then you would rather it be an SLO first, as this allows you to resolve issues before you have to pay penalties. Therefore, your SLOs should be more stringent than your SLAs. But of course, not so stringent that you can’t perform releases or have unreliability, i.e. leave yourself enough error budget.
As Google invented SRE, if you don’t use SLOs to help with reliability-based conversations with your users, you are not really doing SRE (as defined by Google).
Site Reliability Engineering goes beyond these core concepts, and is also about reducing toil (automating things) and improving the systems in their care to improve reliability. But you cannot achieve those goals without proper data and agreements with users, and that is what Service Level Objectives facilitate. Once you’ve established these foundational concepts, your team and your system will both feel empowered to make strategic investments and innovate with confidence.