Diagnosing Problems Quickly

Everything is on fire.

Okay, not really. One thing is on fire. Or is it?

You have no idea. The client knows. The client knows that everyone is screaming and nothing is happening and the dollar signs are evaporating. More are going to go the way of the dinosaur unless you the developer on call/duty, do something. And quickly!

But how? How do you respond quickly to something breaking? How do you get everything sorted?

Error Logs

Well, if you have error reporting services, you're one step ahead. Services like Sentry and Rollbar can tell you when the error happened, what browser, and a tracelog of said error.

Of course, that just tells you where it happened...it being an actual thrown error. What blew up where. But not how. How did all the parts and cogs of your well-oiled machine screw up so that right in that particular place in the code it just came falling apart?

Follow the Data

Data is usually the key. Something's expecting one thing but got another. Tracing through the function calls to get to the origin will tell you what data is the problematic one. The tracelog is your friend. The exact url or API hit that made it happen is another. If the url requested has an ID of a record, or your error reporting service includes that crucial data.

If you have the ID, you can retrace all the steps and find out what transformation or merely what value is 'bad'. Go into the database, open up a rails console or whatever shell you have to access the data as the application sees it, and see what's lurking underneath the layers. Maybe the answer is simply there. If it is, easy as updating and retrying the activity.

Congrats, that's probably the easiest bug there is. Asides from a face-palm-error of just messing up a function call chain or something. Simple stuff. Didn't even break a sweat.

Tracing through the Stack

The real tricky one is a bug that's not a bug (i.e a feature...er no). It's incorrect functionality. The wrong business-thing is happening. The app thinks it's doing the right work, but at the end of the day, the end-user is saying...this is wrong.

In which case, the steps are similar, but you don't get the tracelog. You know where it starts. Maybe you kind of know where it ends.

Where it starts is where you start. Granted you can replicate it locally, start picking apart the process from there. If it's the frontend acting up, start console logging out the actions. What functions are being called where. Order of operations. What data is where and how it transforms—or doesn't.

Then you start logging in the controller or the call endpoint. Where frontend slams into the backend. Log what's there. Then all the steps as it transfers through the system until it gets to its end. Does that sound straight forward? It is.

Wherever in the system you find are where things are going haywire is where your process and journey will truly get wonky. Especially if it's a very narrow use case or some weird quirk. But still, logging every step is the best way to figure out what's going on. If you can pry open the process locally, even better. Locally—implying that the problem is reproduceable locally.

Production-only Problems

If it's in production...check the network tab in Chrome. See what calls to the backend are being made, what's being sent, and what's being returned. The network tab is a great tool in general for troubleshooting—especially if you have no idea what is talking to what. The network tab shows endpoints, it shows you returned data (or error). It's invaluable in determining how the frontend and the backend aren't cooperating together.

Know Thy Enemy

These are some of the steps for what to do once you have a vague idea of a problem. Before that, you need to ask your client or a user experiencing the problem specific questions. Where is this happening? Can you screenshot what is wrong? If you can even get them to screenshot the console log in their browser or the network tab, you've got a wealth of data to trace the problem back through. Ask the specific record being modified. If all you have to go on is "this page isn't working" you're going to get nowhere fast—don't even bother looking into it unless you have some other clue confirming something is happening. The absolute best case is when you can screenshare with the user experiencing it. Then you can do the troubleshooting while it's actually happening instead of hoping to find some way to replicate it.

Where You Go From Here

So, troubleshooting effectively can be reduced to thus:

Know the problem. Acquire detail.
Find accessory evidence for the problem. Logs, error messages, etc. Find the original data being used.
If you can reproduce it locally, log every step of the way through the application.
If you can't replicate it locally, check network tabs on the production record. Speak to the client/user to find out what is happening where, what they do to get the error, and understand the data being used to do that work. Maybe create a copy locally or on a staging server to test with.

If you use these steps and find out the problem, congrats! Refine the process as you encounter more issues and get smarter about things that seem 'smelly'. As in, follow your gut a bit. If you think X is the culprit after all your experience, check it out. Your instincts are valuable in the troubleshooting process. Don't discount them. Use them to find that bug and crush it.

The act of crushing, as in, fixing your system...well, no list or blog post is going to give you 100% of the solution. That's for you to figure out.

Diagnosing Problems Quickly

Error Logs

Follow the Data

Tracing through the Stack

Production-only Problems

Know Thy Enemy

Where You Go From Here

Sarah Sunday

Celebrating our Beacon: Rani Zipelwar!

Modernizing Legacy Systems to Keep Users First

Integrating Agile in the Age of AI: Common Pitfalls and Solutions