How to Make your Customers Happy and Your Engineers Even Happier

How to make your customers happy and your engineers even happier

It’s common for monitoring and alerting solutions on production services to be quite chatty, sometimes producing hundreds of notifications per day. There’s a scene in the classic film The Matrix where one of the characters (Cypher) is looking at a screen with green unintelligible code. He’s asked:

“Do you always look at it encoded?”

He responds by telling Neo he doesn’t see code, just what it represents.

After some time in The Matrix like Cypher, in the sea of notifications you understand what they mean and know what is important. Unfortunately, for your new team members they have to learn what they mean and may have no idea what customer impact these alerts have.

If too many of these notifications are not customer impacting and just informational what follows is “alert fatigue”, you begin to tune out from them because of the volume. This becomes “dangerous” when you miss something in the noise that IS important. You can bet that the notification you missed will lead to a 3AM support call from an irate customer which is SLO not fun!

Scenario

I joined an eight person team for a large enterprise client that had a few production incidents. There were two in July which were high stress problems, the team worked impeccably to resolve them. However, users of the website noticed them first, called Customer Support, then it was “all hands on deck” to try and resolve. This is known internally as a “Major Incident”*. The team fixed the issues and everyone was happy but of course the damage was already done, reputation was harmed. We were in reactive fire fighting mode.

It doesn’t have to be like this. It was not the case that there was no monitoring, there were SLO many Prometheus alerts firing, and that’s the problem. Due to the frequency of these alerts, a lot of them were ignored. They didn’t tell the engineers specifically that a key feature of the site was not working. Just because a pod has crashed with “out of memory” doesn’t necessarily mean that the site is broken for your users. Modern cloud infrastructure is built to be resilient and services can automatically restarted.

SLO, what did we do?

The change was to implement Service Level Objectives - if you’re new to SLOs, I’ve written an introduction to them here and here. We also needed buy-in from the engineering team that increased observability with SLOs and OpenTelemetry would make their lives easier.

The team was used to metrics and alerting from Prometheus, so skeptical that SLOs would be any better. Or even worse would contribute more alert fatigue to an already exhausted team. While the existing Prometheus metrics and Grafana dashboards failed to alert the team of critical problems, in this high-pressure environment people felt safer with the devil they knew.

We conducted some educational meetings to bring people up to speed. Eventually we convinced them to try out SLOs. So we defined what the important user journeys and critical paths looked like. Examples such as:

Can the user log in
Can they make a purchase
Can they get to their account details

We created SLOs around these using the existing metrics that we were capturing. Initially no one knew what level of reliability we should aim for, we did the simplest thing and created an arbitrary level we expected to hit, in our case 99% and started gathering data.

Quickly we realised that wasn’t achievable and we changed our targets. Our first attempts weren’t quite right, but that didn’t matter. The goal was to just ‘do something’ and keep iterating and improving. We took a “lean” approach and didn’t use much ceremony. A quick conversation and we changed it.

The nightmare before Christmas

With SLOs implemented came the first real test. Users of our APIs enabled a particular feature, which it turned out had a serious bug. However our trusty SLOs started burning our error budget and alerted us. The fact that we knew there was a problem before customer support was a real improvement. The Time To Restore was much better.

A few weeks later another feature was deployed that was problematic, and our SLOs told us again. The team sprang into action and resolved the issue. This time, it didn’t reach Customer Support. We did better than last time. We broke up for Christmas and everyone had a good Christmas break!

New year, new beginnings

Because of our success in preventing a major incident, the team now saw the value in SLOs and started to trust them. Rather than me being harsh with the team and telling them we have issues, the team was proactively checking the SLOs graphs. Conversations about the SLOs were happening almost daily without prompting.

A new feature was rolled out by our team in February that impacted the performance of a key API. Our latency based SLOs were alerted. We rolled back the change. No one complained and no one got a support call.

An SRE team rolled out a change to the Kubernetes cluster that our services ran on, moving to Graviton instances to save money. Due to an obscure bug in a library we were using, connections were timing out on Graviton. It would have caused a Major Incident. However, our trusty SLOs alerted us and the change was rolled back in a very short timeframe. No one complained and no one got a support call.

SLO, did we improve?

Massively! Since the start of the year, we’ve had 0 Major Incidents so far. That’s not to say it won’t ever happen again. Like automated testing, SLOs are not a silver bullet, bugs can and will slip through the net, but we have gone from being reactive, to proactive. That’s the big improvement.

Key lessons:

You have to talk to people (who’d have thought it). You have to sell an idea.
Changing culture is hard. People can be resistant to change. Find an ally and build from there.
It’s not a tool problem, it’s a people problem. Changing the culture was the most important thing to do.
You need buy-in from the team rather than it being enforced. If you try to enforce it, the team will push back and you won’t get the change you desire.
People are skeptical about SLOs. They are used to other ways of monitoring and are used to the excess alerts.
Not a silver bullet, like automated testing, you can’t catch everything.
Your users’ experience on particular journeys on your site are important, not CPU / memory / disk usage. Monitor those user journeys instead.
Having silos of a development team that builds it, Ops that deploy it, QA that test it and Ops that monitor it won’t lead to success. The whole team needs to take ownership of the SLOs. The team will take ownership if they see the value, in our case by preventing major incidents. The mantra should be: You build it, you test it, you deploy it, you monitor it, you fix it!

SLO, don’t be like Cypher, don’t learn what all the code means in The Matrix, start measuring and acting on what matters, your users’ experience! You’ll soon be making your customers happy and your engineers even happier.

How to Make your Customers Happy and Your Engineers Even Happier