"Programming today is all about doing science on the parts you have to work with."
Designing and maintaining good software is an endless struggle against complexity. Any application of sufficient size quickly grows into a dizzying combinatorial explosion of code paths and components.
And it's not getting any simpler.
Single-server web applications become distributed systems when deployed to platforms like Heroku and AWS. Modern browsers blur the line between client and server. Straightforward programs become complex coordination problems when they run on multiple CPU cores. Practices like test-driven development and guidelines like the SOLID principles can help us model problems and simplify solutions, but most software applications are complex systems, where individual components can combine to interact in unexpected ways.
When the unexpected happens in a software system (and it will), it can be daunting to connect cause and effect. Fortunately, software developers can take clues from a much older discipline focused on the care, maintenance, and debugging of complex systems: medicine.
Differential diagnosis is a systematic method used by doctors to match sets of symptoms with their likely causes. A good differential diagnosis consists of four distinct steps:
- List all the observed symptoms.
- List possible causes for the observed symptoms.
- Rank the list of causes in order of urgency.
- Conduct tests to rule out causes in priority order.
If you've ever watched House, you've seen a dramatized version of this process in action. Although it may be designed for doctors, thinking like a diagnostician is a powerful way to find and eliminate software defects. Breaking the diagnostic process into discrete steps with a single purpose ensures that each one gets the attention it deserves. Prioritizing causes and treating them in order keeps investigations focused and interventions pragmatic. And conducting tests to rule out hypotheses brings rigor to the arcane art of debugging.
To the whiteboard
When an error occurs, it's tempting to jump in and investigate the most likely cause right away. With a backtrace and a bit of background knowledge, it's easy to start speculating. But a good diagnosis starts with a list of symptoms, not causes. Start any diagnosis by writing down all the symptoms you can observe, whether exceptions, error codes, or just unusual behavior. It's OK to use a text editor or a whiteboard, but it's important to stop and take notes at each step of the diagnostic process. Separating observation from speculation helps ensure you don't exclude or overlook potential causes. And most of the time, listing more symptoms will narrow the possibility space, preventing you from wasting time testing bad hypotheses.
Once you have a list of symptoms, it's time to start considering their causes.
Horses, not zebras
"When you hear hoofbeats in Central Park," goes the medical school adage, "look for horses, not zebras." A bug in your application code is more likely than a bug in your web framework. A bug in the web framework is more likely than a bug in the operating system. Exotic diagnoses make great stories, but the truth is that most bugs are boring. Start by considering the simplest explanations before moving on to more complicated stories.
Then again, the zebra aphorism has a pithy counterpoint: "Patients can have as many diseases as they damn well please." It's worth writing down any theory that comes to mind, even if it may not be the most likely. Just like writing down symptoms separates "what" from "how," brainstorming explanations is designed to separate "how" from "how likely." Capture anything that seems plausible and save the analysis for later.
First, do no harm
Once you have a list of explanations, it's time to prioritize. Differential diagnosis differs from other deductive methods since doctors must constantly evaluate risks and make tradeoffs that affect the lives of their patients. Lives may not always be at stake when there's a bug in production, but downtime has very real and painful costs. Just as life-threatening events require immediate intervention, severe bugs may call for crude fixes like rollbacks or reboots. Rank your hypotheses in order of priority, then consider the trade-offs and use your best judgment to decide whether to start testing hypotheses or intervene immediately.
Pull the charts
Just as patients in a hospital have charts containing medical history and other background information, your software system probably has a chart, too. Gather information from logs and error reporting systems to inform your analysis. If you're not already collecting system metrics or tracking errors, consider it prudent preventive medicine.
If your patient is not in grave danger, it's time to get hypothetico-deductive. Start with the highest-priority hypothesis you've identified and try to prove it wrong. Supporting evidence may indicate where to look for bugs, but failing tests drive the deductive process. This may seem counterintuitive at first, but testing and eliminating hypotheses is the fastest way to track a bug to its original cause. In many cases, a few simple tests can eliminate several hypotheses at once. In others, you'll have to order more tests.
Lab work
Unlike the medical world, where mad science is frowned upon, you may be able to clone your software application and perform as many gruesome experiments as you'd like. If you have enough information to trigger the bug you're trying to diagnose, try to replicate it in a controlled environment, like a staging server with a recent database backup. As you eliminate causes, collect new data, and refine hypotheses, eventually your bug's cause will become more clear.
Thinking clearly about complex systems requires care and attention. Using a structured diagnostic process to guide investigations saves time and prevents frustration. Best of all, it works. Next time you run into a bug, set aside the keyboard, step up to the whiteboard, and start debugging like a doctor.