Site Reliability Engineers: Becoming Crime Scene Detectives through Observability

Photo by Mediamodifier on Unsplash

Andy Smith

June 09, 2023

Updated May 1, 2025

Working in software development can sometimes feel like working in a crime scene.

There are no murders, but often there’s plenty of dead code and development practices that feel criminal. That’s why working as a Site Reliability Engineer (SRE) requires detective skills. We often have to investigate problems by starting with a theory and gradually eliminating suspects until we find the culprit.

Sometimes I feel like I’m in an episode of Columbo. If you’ve never seen the show, every episode starts with someone getting killed, and Lieutenant Columbo spends the rest of the episode pestering the killer until he gathers enough evidence to arrest that person—all while using his trusty tools of a notebook, pen, and cigar.

On one of my projects, I was faced with a defect of mysterious origins that required a detective’s eye to piece together. Like Lieutenant Columbo, the team had a suspect (hypothesis) and we attempted to find the killer (prove or disprove that hypothesis) using our trusty tools of service level objectives (SLOs), distributed tracing, and metrics.
 

Exhibit A: Service Level Objectives

The system we were working on was a microservices architecture, meaning the software was composed of small independent services that communicate with each other over well-defined APIs, and the perception for how it was performing was not great.

To determine the system’s true performance reliability — a more scientific reading on the system’s performance beyond “not great” — the team applied SLOs to the case. The SLO was defined as follows:

“99% of all requests should return in 5 seconds or less”

This meant the error budget allowed for one percent of all requests to take longer than five seconds or not return at all.

The tool used to vizualise SLOs (Nobl9) provided a graph that showed when their error budget was used. The team reviewed these graphs regularly and could see that, over time, the latency was getting longer, breaching the error budget SLO that we’d agreed on with the development team.

The below SLO graph is an example of the latencies seen during investigation. The pink line denotes five seconds, meaning anything above the line was eating into our error budget. 

Observability 1
Exhibit A1

The example above shows the latencies, and the pink line denotes 5 seconds. Anything above that pink line starts to eat into your error budget.

The rule or guideline for error budgets is:


If there’s error budget left then
	Release new features
else 
	Focus on reliability


    

In this case, the error budget guideline indicated the team should focus on reliability.

Exhibit B: Distributed Tracing for Observability

“Traces help you understand system interdependencies. Those interdependencies can obscure problems and make them particularly difficult to debug unless the relationship between them are clearly understood.” — Observability Engineering by Charity Majors, Liz Fong-Jones, and George Miranda

The system we were working with comprised many microservices, making it quite difficult and laborious to track down the problem. Distributed tracing—one of the tools described in Observability Engineering—had already been implemented across the system, providing us with observability into the system’s interdependencies and clues pointing to the culprit. Traces were sent to the observability tool of choice, Lightstep (shown below).

The graph shows the response times of requests for service that the SLO highlighted as being an issue. The advantage of this type of tool is that it allows the ability to click on a data point within the graph and receive a trace of all the calls behind the chosen request. 

Observability 2
Exhibit B1

In the investigation, we clicked on a point with a large latency—one of the tall spikes in the graph—which brought up a distributed trace (example screenshot included below). The trace provided observability into the call stack between every microservice and the time spent in each one.

Here’s what we noticed: 

  • For a particular span, it was taking around 60 seconds on the client side, but the other end was only taking a few milliseconds.
  • The front-end component was timing out at 15 seconds, while the later services had a much longer timeout time, noting stability issues. (As Sam Newman says in Building Microservices: “Timeouts are … easy to overlook, but in a downstream system they are important to get right.”)
     
Observability 3
Exhibit B2

After reviewing the data the trace provided, the team could see that the software was trying to acquire a HTTP connection, but it was taking too much time. Opening sockets are relatively expensive operations, so quite often connection pools are used to hold a number of already created socket connections that can be leased out to a client wanting to make a request. This is to help with performance. Our theory was that there were maybe not enough connections in the pool to make the request — but of course we needed evidence. We needed to see if the pool had no free connections.

Exhibit C: Metrics

Next, Prometheus metrics— such as “number of free connections,” “max number of connections,” and “number of pending connections”— were added to the connection pool.  Then we deployed our software to production to gather evidence to prove or disprove our theory.

After a day of the software being live, we graphed the metrics from those 24 hours and compared it to our Lightstep graph.

Bingo! We found our smoking gun. (we could see a correlation.)

Observability 4
Exhibit C1

In Exhibit B1, we see latency spikes at exactly the same points on the graph as the spikes in Exhibit C1, highlighting the areas of pending connections. This showed that there were no free connections in the pool to make the requests, leaving them to wait until a connection was free.

This evidence proved our hypothesis was right (our suspect was the killer)! From there, all we had to do was implement a simple fix to increase the size of the connection pool.

We made our change, deployed the fix, and again monitored our graphs to ensure our changes worked.

Hurrah, the latency was gone! Columbo always catches the killer.

Preparing for Your Own Investigation

Like Lieutenant Columbo, trusty tools helped catch a killer (aka, fix our software latency problem).

As you strive for observability to find your own software criminals, note that SLOs are the first line of defense in knowing when reliability is getting worse. In our case it showed large latencies that gave us more information to investigate further. Additionally, distributed tracing is incredibly helpful. It allowed us to track down the exact function in Lightstep that was causing the latency, all within minutes and laser sharp focus, providing us with a clue as to what the issue may have been.

Even in the days of increased web performance with HTTP/2, connection pools are still vitally important. You still need them! Make sure you choose a sensible number of connections when you set them up rather than going with the default settings—they’re more than likely wrong for your scenario. If needed, use load testing to work out what this figure should be.

Timeouts are important, do not go with the defaults. If the frontend times out after 15 seconds, does it make sense for downstream components to timeout in 60 seconds? Do your research and talk to owners of other components to determine what is right for you.

Just One More Thing

Fixing microservices architectures can be like a game of Whac-A-Mole. Fix one issue and another one pops up. But like the great Columbo, with enough persistent pestering, the Site Reliability Engineer will always catch the culprit.

If you need support adding observability into your software development process, contact us and we’ll help you get started

Andy Smith

Lead Developer and Crafter

Andy Smith first cut his teeth on a Commodore 64 and had a game published. Certain fruity companies would call the game "revolutionary," others would call it a "shoot em-up." He has been tinkering with computers ever since.