Everybody OOMs

We've all run our applications out of memory from time to time. In development and testing, an OutOfMemory error, or OOM, is fine. We catch the error early, sigh at ourselves for making a silly mistake, and mutter jokes about Turing machines not having this problem as we patch things up. But in production, if a project is like many I've seen, we might just restart the app.

"Whoops!" we say, "it's out of memory again, better roll it." If we're sufficiently automation-addicted (I mean seriously, let's not be on call for this), we may even deploy some process whose sole responsibility is to kick the app over and restart it when memory usage gets too high. Some folks might see an analogy here to the Erlang "Let it crash" philosophy, but the context is crucial: we don't really want the whole application server to go down.

Last week we ran one of our projects out of memory. But with the help of some pretty sweet JVM tooling, we were able to pretty efficiently find the root cause and get a fix rolled out. If you're not on the JVM (maybe especially then), I'd advise you to read on. This tooling is... well it's a thing, and a thing that I want to be pervasive in every language/platform where I work.

Because I think these techniques are non-obvious for many developers, and because people love to hate on the JVM, I thought I'd write up some notes on how we tracked down the issue to solve it for real, while keeping customer impact minimal.

Here's how things went down:

Alert! Alert!

We got an alert from CloudWatch that the site was slow. Upon further investigation, we found that CPU was pegged on all of the project's web application servers.

We've occasionally had issues that looked like this before and seen them go away upon restarting the application server. But that kind of solution feels unsatisfying. When we don't fully understand why a problem happened, we can't be confident that it's gone for good. And that means we're at risk of future downtime.

So we wanted to know what was really going on. But we also wanted happy customers, so we added new application servers to the load balancer and removed the misbehaving ones. This let us continue investigating the problem servers while simultaneously getting things back up and running quickly.

We got information on the specific thread that was being naughty by using top -H to show all threads, then converting the misbehaving thread's PID to a hex value, and looking at output from jstack (which gives us stack traces of all JVM threads) to identify the specific thread. This got us a thread name: VM Thread, but not much useful from the VM Thread stack trace itself. The team lead, Mike Danaher, had the idea that it might be garbage collection, since VM Thread runs VM-wide kinds of operations, including GC (depending on what GC algorithm you're using). Turns out GC is a CPU-intensive operation. You can either take my word for it or read Java Performance: The Definitive Guide, but I recommend doing both.

So, is it GC? Finding this answer feels a bit like jumping ahead, but there are only a small number of things that happen on the VM Thread. I suspect we could have used lower-impact tools like perf to collect stack traces directly, but it turns out that it was GC.

jstat -gcutil -t -h20 PID_HERE 1s gives a bunch of details about GC information, printing the latest info every second. Here -gcutil specifies a certain kind of garbage collection info, -t includes a timestamp column, -h20 prints a header row every 20 lines of output, and 1s prints updated results every second. Check out man jstat for much more detail on the available flags, as well as interpreting the output. We saw an 8-second GC after only 9 seconds of runtime, which means that GC was taking the vast majority of the process's resources.

Incidentally, hindsight is 20/20: we could've avoided the potentially obscure JVM tooling by just looking at the GC logs that the app generates. Enabling GC logs is easy, although deciding exactly which flags to use seems to be a bit of a dark art. Again, check out Java Performance: The Definitive Guide for more detail. We already had these GC flags enabled, for what it's worth:

-Xloggc:log/gc.log -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=16M

At any rate, we forgot to look at those GC logs, but jstat did get us the info we wanted.

Where does the garbage come from?

So we knew that GC was causing the slowness / CPU burn. Now what was actually using memory?

Running jcmd PID_HERE GC.class_histogram gave us a snapshot of the classes with the most objects on the heap. It looked something like this:

 num #instances #bytes class name
----------------------------------------------
 1: 19880123 606978696 [B
 2: 14512072 421377376 [Ljava.lang.Object;
 3: 14410179 345844296 clojure.lang.PersistentHashMap$BitmapIndexedNode
 4: 2479896 277740000 [[B
 5: 794477 114404688 [Lclojure.lang.PersistentHashMap$INode;
 6: 980584 50838528 [C
 7: 1557412 49837184 java.sql.Timestamp
 8: 801269 32050760 clojure.lang.PersistentHashMap
 9: 1630449 26087184 java.lang.Integer
 10: 809041 25889312 clojure.lang.LazySeq

Seeing [B (that's pronounced "byte array") is pretty normal on the JVM, but I'm more used to seeing it accompanied by java.lang.String entries. This is because strings are made up primarily of byte arrays. Also: I don't do operations work often enough to have this notation memorized. I googled "[B" "[Ljava.lang.Object" java to recall what they meant. Don't ever assume people know what they're doing, just because they know about cool tools. [ed: Proving my point even further, the underlying data behind a String is a char array, not a byte array, even though .getBytes is a common conversion.]

So we've got a byte array, an object array, a Clojure map, an array of byte arrays... To our eyes, there weren't any obvious / clear causes, so we grabbed a full heap dump, using jcmd PID_HERE GC.heap_dump. Or we tried to. The heap dump failed, because we were completely out of disk space. This EC2 instance had an 8GB main volume, and we had maxed it out.

Luckily for us, we'd set the JVM flag -XX:+HeapDumpOnOutOfMemoryError. So it turned out there was a 3GB heap dump file (java_pid28763.hprof) sitting waiting for us in the application server's working directory. We were wary of the time it'd take to download this large file, so we figured we'd compress it with gzip. This was taking an annoyingly long time, though, so we just scp'ed it directly. It wasn't too bad, just a few minutes to pull it down locally for analysis.

Heap dump analysis

I'm not a tools guru, but Eclipse MAT has worked well for me in the past. Unfortunately this time, it failed to load the heap dump: the memory analyzer itself (which also runs on the JVM) ran out of heap space! This makes sense in retrospect: to work on a large heap dump, you need lots of heap. So we found a seam to add VM flags to allow a bigger maximum heap for the MAT program: in /Applications/mat.app/Contents/Info.plist on my Mac, we just added <string>-vmargs</string><string>-Xmx5g</string>.

That made things quite a bit happier. It did take awhile to preprocess the heap and the program used some words that I thought were funny, but eventually we had a nice view of the heap dump we'd captured.

Inside MAT, the first thing we saw was a dashboard with a graph of the "Biggest Objects by Retained Size," along with a bunch of links to things like a "Histogram" (like we saw via jcmd earlier, a "Dominator Tree" (there's one of those awesome names again!), and "Leak Suspects." OMG so many things!

posts/2017-03-17-everybody-ooms/dashboard.png

I'm not going to bore you with all the details of what MAT can do. The "Dominator Tree" sounded pretty awesome, and what was inside was even better: basically a table-based version of the pie chart above.

posts/2017-03-17-everybody-ooms/dominator-tree.png

The first row, at around 1GB of retained heap, was a java.lang.Thread object. It's typical for Threads to be the top-level owner of whatever data is being processed currently, maybe some thread-local caching, and I'm sure plenty of other things. But the one that really jumped out to me as interesting was the second row: org.postgresql.jdbc4.Jdbc4ResultSet. That looks pretty database-y, and feels like it brings us right up close to a potential issue. Why would a database query in our web application be using over 700MB of heap space? Is the database storing CD images or something?! [Sorry, not sorry.]

The Inspector tool in MAT gives us a more detailed view of things we click on. And in this case, clicking on the Jdbc4ResultSet lets us drill down into that object's attributes!

posts/2017-03-17-everybody-ooms/inspector-with-query.png

Pretty neat stuff, and even more excitingly, one of the fields is called originalQuery. Well that sounds promising—we can get ahold of the original query that's presumably causing this memory bloat?! Sure enough, right-click and "Go Into", follow another object reference down the chain (fragments, an array of Strings), and we've got our query:

SELECT "redacted_large_table_name_here".* FROM "redacted_large_table_name_here"

Incredible! We found an expensive query that would absolutely cause memory bloat. It would be cool if we automatically knew where that query came from, but that wasn't the case. So now we had to figure out what code was making that query.

That right-click menu has tons of stuff in it, and one of the things is "Path to GC Roots".

posts/2017-03-17-everybody-ooms/right-click-gc-roots.png

The idea is that we can trace the object references back to whatever is the main owner. In our case, it turns out that the GC root for this Jdbc4ResultSet is a java.lang.Thread (the same one that was the top memory owner).

posts/2017-03-17-everybody-ooms/gc-root-paths.png

After a little more digging around in the right-click menu, we realized that given a Thread, we can use "Java Basics" -> "Thread Details". So we could take this GC root Thread and look at the stack trace!

posts/2017-03-17-everybody-ooms/thread-details.png

At this point, we've got the specific code path that led to the memory bloat. Which is totally awesome.

Let's not dig into the code that had caused the bug. Suffice to say, it was a rarely traveled code path that depended on rarely entered input that resulted in a missing WHERE clause in our database query. We rolled out a fix quickly, and things have been running smoothly ever since.

Retro

As usual with these kinds of investigations, the actual bug that caused this issue isn't that important. The bigger lesson for our team was the process of using this memory analyzer to track down our issue.

Besides the terrific IDEs and refactoring tools, this is another reason you hear people saying the JVM has great tooling. I assume there are other language / runtime ecosystems (maybe .NET?) that have tooling ecosystems approaching this level, but this deep runtime-specific insight is pretty intense. I certainly haven't seen it in other languages I've used. So... chalk one up for the JVM!

Tooling as robust as this can help you to dig in and get to the bottom of production issues when they happen, instead of resigning yourself to restarting the application. Regardless of the runtime you're using, get familiar with the tools that are available to you—they might just save your app!

Glossary

CloudWatch: an AWS service that allows alerting and monitoring for various statistics. It told us that latency was high for requests through our load balancer.
top: a typical command-line unix tool to show CPU usage. We used the -H flag to show thread specifics.
jstack: a command-line JVM tool to show stack traces for all running threads at a point in time.
jstat: a command-line JVM tool to show running totals for various statistics (depending on the flag you pass). We used -gcutil to see garbage collection stats.
jcmd: a command-line JVM tool that does all kinds of things. We used:
- the GC.class_histogram command to show a snapshot of top memory consumers
- the GC.heap_dump command to [attempt to] create a heap dump
-XX:+HeapDumpOnOutOfMemoryError: a JVM flag to generate a heap dump when... well, the flag's name tells you when.
-Xloggc:log/gc.log -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=16M: a bunch of JVM flags we used [but forgot to take advantage of] to generate garbage collection statistics.
MAT: Eclipse Memory Analyzer Tool. It allowed us to navigate a heap dump and see what objects were retaining memory, what objects were retaining them, etc.
-Xmx5g: a JVM flag to control heap space. We bumped the heap available to MAT up to 5GB so it could process the whole heap dump.

Alert! Alert!

Where does the garbage come from?

Heap dump analysis

Retro

Glossary

Colin Jones

Is Your Company’s Technical Debt Holding Back DevOps Success?

Managing Technical Debt

Empower Your Teams to Prioritize Product Safety at Every Stage