Ask DTrace: Why are my tests so slow?

Ask DTrace: Why are my tests so slow?

Colin Jones
Colin Jones

December 01, 2015

We were in pain. We'd just joined a new Rails project, and the test suite took over 10 minutes to run. I've seen worse, but that was before I knew anything about DTrace. I wanted to know why it was so slow, and what it'd take to make it fast.

Last time, we saw a few neat things DTrace can do, but now we're going to see a practical example of how it can actually help you out day-to-day. To help motivate you, I'll skip right to the punch line: we cut the test suite's running time by better than half with a 1-line change, and DTrace was crucial in tracking down the cause of the slowness.

Now let's roll back in time to the beginning: we have a slow test suite, and I've got no shortage of theories as to what's making it slow. I've seen seconds or minutes shaved off Rails test suites by moving from factories to fixtures, avoiding slow ActiveRecord lifecycle hooks, turning off observers, decoupling from Rails entirely, and plenty of other strategies. Every Rails developer who cares about fast test feedback has tried these things. But lately I've been understanding more and more that just as with other hard problems, performance problems are best solved by first understanding the problem, before jumping to possible solutions. In walked DTrace.

First on the agenda: what resources are being consumed most heavily during the test run? Is it disk? network? CPU? memory? Brendan Gregg's USE method page for OS X lists some great tools to scratch the surface of performance problems.

All of the following command-line examples were performed with the test suite running, after the initial code-loading and startup stuff (that'd be interesting to explore too, but outside the scope of this investigation).

Where is the system spending time?

We're definitely using some disk and CPU:

$ iostat 1
 disk0 disk2 cpu load average
 KB/t tps MB/s KB/t tps MB/s us sy id 1m 5m 15m
 5.62 311 1.70 0.00 0 0.00 20 14 66 1.93 1.99 2.15
 42.83 273 11.43 0.00 0 0.00 19 10 71 1.93 1.99 2.15
 30.04 112 3.27 0.00 0 0.00 21 9 70 1.93 1.99 2.15
 26.71 141 3.68 0.00 0 0.00 19 10 71 1.93 1.99 2.15

The first line shows that when the metrics were collected, the CPU was 20% user time (us), 14% system/kernel time (sy), and 66% idle (id).

It's hard to know how to interpret these numbers on my development machine, since there may be other things going on like system indexing, web browsing, automated backup, etc. But we have to start somewhere! The load average seems kind of high since I only have two processors on my laptop. Let's take a quick look at top, sorted by CPU usage, to make sure it's the test run that's causing the CPU load.

$ top -o cpu
Processes: 272 total, 3 running, 3 stuck, 266 sleeping, 1666 threads 09:41:13
Load Avg: 2.00, 2.01, 2.18 CPU usage: 26.63% user, 8.52% sys, 64.83% idle SharedLibs: 1452K resident, 0B data, 0B linkedit. MemRegions: 56624 total, 3132M resident, 61M private, 412M shared.
PhysMem: 7024M used (1480M wired), 1166M unused. VM: 681G vsize, 1026M framework vsize, 13730951(0) swapins, 14769117(0) swapouts. Networks: packets: 42172600/23G in, 26937732/10G out. Disks: 26856753/266G read, 13907921/306G written.

PID COMMAND %CPU TIME #TH #WQ #PORT MEM PURG CMPRS PGRP PPID STATE BOOSTS %CPU_ME %CPU_OTHRS UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW PAGEINS IDLEW POWE
76355 ruby 93.9 00:05.76 2/1 0 17 164M+ 0B 0B 76355 67443 running *0[1] 0.00000 0.00000 501 48477+ 605+ 146 62 1006130+ 517+ 2840+ 0 0 93.9
280 mds_stores 33.2 37:16.48 6 4 66- 90M+ 1516K 34M 280 1 sleeping *0[1] 0.00000 32.99901 0 28708816+ 50302 28020174+ 13965778+ 93394291+ 25841994+ 32170381+ 1988248+ 89348 33.2
64 mds 5.1 62:11.95 10 7 607+ 25M- 0B 33M 64 1 sleeping *0[1] 0.00000 3.78791 0 47422915+ 2888 100960224+ 51541514+ 346518839+ 80922925+ 86856281+ 3131 313515 5.1
350 Finder 1.4 04:30.63 9 3 388+ 26M- 0B 38M 350 1 sleeping *0[7504+] 2.79131 0.00000 501 3820018+ 2905 1165710+ 321061+ 10432182+ 2981232+ 3213172+ 37538 11310 1.4
342 iTerm 1.4 40:25.64 9 1 605 36M 0B 48M 342 1 sleeping *0[22716+] 0.00000 0.09288 501 2567934+ 85616 7940320+ 1976360+ 63129208+ 21811017+ 10492837+ 8965 330277 1.4

Interestingly, while the total system-wide user + system CPU percentages from top only add up to ~35%, the CPU load in top for just the Ruby process is almost 94%! See Brendan Gregg's Broken Performance Tools slides for more on how top and other tools can be misleading. Ultimately, the conclusion in our case is that our tests run on a single processor at a time (of the two total), so we may indeed be CPU-bounded here, even though we have one entire CPU idle (50% of the machine). Because the system call percentage is a decent percentage of the total CPU percentage, I'm interested in what system calls are taking up time.

Digging into the details

So let's use DTrace to see what system calls (syscall:::entry) are triggered by our Ruby process (/execname=="ruby"/), and sum the system call counts into a DTrace aggregation (@[probefunc] = count();) whose top five entries we can inspect every second before clearing it for the next second's results (tick-1s { trunc(@,5); printa(@); trunc(@);}). Check out the DTrace Guide if you're looking for a DTrace tutorial—I'll have more DTrace examples here but won't go into detail on the language. Here's what this looks like:

$ sudo dtrace -n 'syscall:::entry /execname=="ruby"/ { @[probefunc] = count(); } tick-1s { trunc(@,5); printa(@); trunc(@);}'
dtrace: description 'syscall:::entry ' matched 491 probes
CPU ID FUNCTION:NAME
 0 278139 :tick-1s
 fcntl_nocancel 1102
 fstat64 1102
 open_nocancel 1102
 stat64 1162
 open 117372

 0 278139 :tick-1s
 fcntl_nocancel 1146
 open_nocancel 1146
 fstat64 1150
 stat64 1301
 open 122042

 0 278139 :tick-1s
 fcntl_nocancel 1056
 fstat64 1056
 open_nocancel 1056
 stat64 1116
 open 112358

Wow. There are lots and lots of open calls, around 100x the calls to anything else! There are over 100,000 calls to open every second! It seems like an awful lot. It might seem normal to open lots of files during startup, when language and library code is being loaded, but the only files I knew of that needed to be opened were the test files themselves—and we sure as heck weren't running anything approaching 100k test files per second.

OK, so what was causing all the open calls? The ustack() action doesn't always help [1], but in this case it worked OK for our purposes: the top hits all look related to something called psych.bundle.

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby"/ { @[ustack()] = count(); } tick-1s { trunc(@,3); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU ID FUNCTION:NAME
 0 278139 :tick-1s
 libsystem_kernel.dylib`__open+0xa
 ruby`rb_require_safe+0x35a
 ruby`rb_protect+0xd0
 ruby`load_encoding+0x118
 ruby`rb_enc_find_index+0xa9
 psych.bundle`parse+0x1c6
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield+0x63
 ruby`rb_hash_fetch_m+0x75
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield+0x63
 ruby`rb_ary_collect+0x79
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield+0x63
 7952

 libsystem_kernel.dylib`__open+0xa
 ruby`rb_require_safe+0x35a
 ruby`rb_protect+0xd0
 ruby`load_encoding+0x118
 ruby`rb_enc_find_index+0xa9
 psych.bundle`parse+0x1b7
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield+0x63
 ruby`rb_hash_fetch_m+0x75
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield_values+0x15d
 ruby`invoke_block_from_c+0x1e7
 ruby`rb_yield+0x63
 ruby`rb_ary_each+0x49
 ruby`vm_call0+0x1ed
 ruby`rb_iterate+0x168
 12496

 libsystem_kernel.dylib`__open+0xa
 ruby`rb_require_safe+0x35a
 ruby`rb_protect+0xd0
 ruby`load_encoding+0x118
 ruby`rb_enc_find_index+0xa9
 psych.bundle`parse+0x1c6
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield+0x63
 ruby`rb_hash_fetch_m+0x75
 ruby`vm_call_method+0x3bc
 ruby`vm_exec_core+0x2583
 ruby`vm_exec+0x72
 ruby`rb_yield_values+0x15d
 ruby`invoke_block_from_c+0x1e7
 ruby`rb_yield+0x63
 ruby`rb_ary_each+0x49
 ruby`vm_call0+0x1ed
 ruby`rb_iterate+0x168
 12609

OK, so Psych is a YAML library (which I think I knew at some point). Now I'd like to know which files are being read. Maybe this is something to do with some configuration files or test fixtures?

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby"/ { @[copyinstr(arg0)] = count(); } tick-1s { trunc(@,10); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU ID FUNCTION:NAME
 0 278139 :tick-1s
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_be.so.rb 64
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.bundle 64
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.rb 64
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.bundle 66
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.bundle 96
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.bundle 96
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.rb 96
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.bundle 96
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.rb 96
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.bundle 99

 0 278139 :tick-1s
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_be.bundle 70
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_be.so.rb 70
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.bundle 70
 /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.rb 70
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.bundle 102
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.bundle 102
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.bundle 105
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.rb 105
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.bundle 105
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.rb 105

Huh, that's pretty weird—I have no idea what those files are. There's nothing in the application code that refers to "utf16le" or "utf16be". In fact, on a closer look, it turns out these files don't actually exist—so these system calls must actually be failing. Since these all look pretty similar, what if we zoom back out and show the top 50 instead of only the top 10?

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby"/ { @[copyinstr(arg0)] = count(); } tick-1s { trunc(@,50); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU ID FUNCTION:NAME
 3 278139 :tick-1s
 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/1.9.1/x86_64-darwin13.2.0/enc/utf16_be.so.rb 39
 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/1.9.1/x86_64-darwin13.2.0/enc/utf16_le.so.bundle 39
 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/enc/utf16_be.bundle 39
 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/enc/utf16_be.so.rb 39
 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/enc/utf16_le.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/api/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/api/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/api/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/assets/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/assets/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/assets/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/controllers/concerns/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/controllers/concerns/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/controllers/concerns/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/controllers/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/controllers/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/controllers/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/helpers/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/helpers/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/helpers/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/mailers/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/mailers/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/mailers/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/models/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/models/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/models/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/app/workers/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/app/workers/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/app/workers/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/lib/assets/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/lib/assets/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/lib/assets/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/lib/tasks/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/lib/tasks/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/lib/tasks/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/test/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/test/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/test/enc/utf16_be.so.rb 39
 /Users/colin/Projects/clients/redacted/vendor/enc/utf16_be.bundle 39
 /Users/colin/Projects/clients/redacted/vendor/enc/utf16_be.so.bundle 39
 /Users/colin/Projects/clients/redacted/vendor/enc/utf16_be.so.rb 39
 /usr/local/Cellar/rbenv-gem-rehash/1.0.0/enc/utf16_be.bundle 39
 /usr/local/Cellar/rbenv-gem-rehash/1.0.0/enc/utf16_be.so.rb 39
 /usr/local/Cellar/rbenv-gem-rehash/1.0.0/enc/utf16_le.so.bundle 39
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.bundle 114
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.bundle 114
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.rb 114
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.bundle 117
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.bundle 117
 /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.rb 117

Same deal as before: the file doesn't exist in any of these locations where the Ruby app is searching. Let's get DTrace to see where the actual file eventually gets loaded from, by tracking the open calls with "utf_" as part of the filename, where the return value is not -1 (the error value, from man 2 open). We can use a DTrace thread-local variable (self->location) to carry the filename across from the syscall entry to the return:

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby" && strstr(copyinstr(arg0), "utf16_") != NULL/ { self->location = arg0; } syscall::open:return /self->location != NULL && arg0 > -1/ { @[arg0, copyinstr(self->location)] = count(); } tick-1s { printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 3 probes
CPU ID FUNCTION:NAME
 1 278139 :tick-1s
 20 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/rack-mount-0.8.3/lib/rack/mount/vendor/regin/enc/utf16_be.so.bundle 2

 1 278139 :tick-1s

 1 278139 :tick-1s
 20 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/rack-mount-0.8.3/lib/rack/mount/vendor/regin/enc/utf16_be.so.bundle 7

 1 278139 :tick-1s

 1 278139 :tick-1s
 20 /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/rack-mount-0.8.3/lib/rack/mount/vendor/regin/enc/utf16_be.so.bundle 4

 1 278139 :tick-1s

 1 278139 :tick-1s

 1 278139 :tick-1s

OK, so this part is a little weird: that file location also doesn't exist, but for some reason open(2) isn't failing. I'm genuinely interested in knowing how this can be [2], but that location is the only one that shows up. Plus, since it definitely doesn't exist in reality, I'm starting to feel confident that the file isn't ever getting loaded from any of the locations that Ruby is trying to open. So I'm not going to dig into it any further for now.

Strange things are afoot

At this point, based on the list of paths we're attempting to open, a problem is becoming clear: something is searching the whole Ruby $LOAD_PATH looking for two libraries: enc/utf16_le and enc/utf16_be (with suffixes of .bundle, .so.bundle, and .so.rb). And our $LOAD_PATH, like those in many other Rails projects, is pretty big:

$ bundle exec ruby -e 'puts $LOAD_PATH.size'
236

So every time whatever-it-is looks for these shared library files, it might have to look in 236 * 6 = 1416 places before finding the file it's looking for. We'd only have to do this thing ~1000 times per second in order to get the 100k total open attempts per second that we've observed.

Where could this file actually live, and why isn't it on the $LOAD_PATH?

$ find ~/.rbenv/versions/1.9.3-p194 -name "utf16_be*"
$ find ~/.rbenv/versions/1.9.3-p194 -name "utf16_le*"

Nothing—these files just don't exist! Maybe there's a directory called enc where the files are supposed to live?

$ find ~/.rbenv/versions/1.9.3-p194 -name "enc" -type d
/Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc
$ ls /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc
big5.bundle euc_jp.bundle gb2312.bundle iso_8859_11.bundle iso_8859_16.bundle iso_8859_5.bundle iso_8859_9.bundle trans utf_32le.bundle
cp949.bundle euc_kr.bundle gbk.bundle iso_8859_13.bundle iso_8859_2.bundle iso_8859_6.bundle koi8_r.bundle utf_16be.bundle windows_1251.bundle
emacs_mule.bundle euc_tw.bundle iso_8859_1.bundle iso_8859_14.bundle iso_8859_3.bundle iso_8859_7.bundle koi8_u.bundle utf_16le.bundle
encdb.bundle gb18030.bundle iso_8859_10.bundle iso_8859_15.bundle iso_8859_4.bundle iso_8859_8.bundle shift_jis.bundle utf_32be.bundle

Whoa. The files are right there, at the end of the list! Why couldn't find find them before? Is find broken?! Take a closer look: utf_16be.bundle is the file that actually exists, and utf16_be.bundle is the file we're looking for. Could this be the issue? Let's try copying the files to get them in both locations:

$ cp /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc/utf{_16be,16_be}.bundle
$ cp /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc/utf{_16le,16_le}.bundle

Unfortunately, it doesn't seem to help much. We've got some successful file opens now, but we're still making a similar number of open calls per second, and the test suite isn't really any faster. I don't know what these files are supposed to do, so maybe the naming is somehow tied to each file's contents.

That was kind of a silly plan anyway—we're not going to win any friends by asking developers to copy files around within their Ruby installation. What's next? I googled around a bit and also pulled down the psych source code to look for "utf16"-related things:

$ grep -r -i utf16_be .
 0
$ git log -Sutf16_be -i --oneline | wc -l
 2
$ git log -Sutf16_be -i --oneline | head -n2
0845b07 * ext/psych/parser.c (transcode_string): fix encoding index names. Thanks markizko for reporting.
9eb1264 * ext/psych/parser.c: set parser encoding based on the YAML input rather than user configuration. * test/psych/test_encoding.rb: corresponding tests. * test/psych/test_parser.rb: ditto * test/psych/test_tainted.rb: ditto

Aha! No instances of utf16_be in the master branch, but git log -S reveals that it used to be there, but was removed in mid-2012 with the "fix encoding index names" commit. What happens if we bump the version of Psych to the latest? I won't bore you with the details (and I haven't really investigated deeply), but that breaks all the tests. Could we just bump it to the earliest version where commit 0845b07 appears, as a smaller step that's less likely to break compatibility? Bumping the Ruby patch-level version from 1.9.3-p194 to 1.9.3-p286 would have the same results, because the Ruby distribution itself includes Psych, and Ruby 1.9.3-p286 includes that Psych commit.

Trying potential solutions

Since we have to pick one approach or the other, let's update our project's .ruby-version file to use 1.9.3-p286, restart the test suite under the updated Ruby version, and see if that does the trick:

$ sudo dtrace -n 'syscall:::entry /execname=="ruby"/ { @[probefunc] = count(); } tick-1s { trunc(@, 10); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU ID FUNCTION:NAME
 0 278139 :tick-1s
 __pthread_sigmask 156
 lstat64 190
 sendto 310
 recvfrom 315
 write 502
 stat64 503
 madvise 841
 select 1176
 sigaltstack 2550
 sigprocmask 2550

 0 278139 :tick-1s
 __pthread_sigmask 71
 poll 98
 sendto 111
 recvfrom 113
 write 189
 lstat64 205
 select 387
 stat64 388
 sigaltstack 926
 sigprocmask 926

Now open doesn't even make it into the top 10 system calls, which is a heck of a lot better than seeing it dominate at 100k calls per second! Running the whole suite a couple of times under each version confirmed that this pays off in the real world:

Baseline (1.9.3-p194) Updated (1.9.3-p286)
Trial 1 661s 297s
Trial 2 665s 305s

So with a one-line configuration change, whether we decide to bump the Ruby patch-level version or to bump just Psych (as expected, both worked), we're the proud new owners of a five-minute test suite. Some teammates had even more dramatic results, with the speedup looking closer to 3x than "just" 2x. Waiting five minutes to know I haven't broken the tests is still longer than I'd like, but we've got a nice payoff for a few hours' investment.

Understanding the problem: it's the real thing

I want to circle back a bit and emphasize here that while the specific issue that caused the slowdown is interesting, much more interesting is that we went from a vague feeling of "these tests are too slow," to having some idea what resource(s) might be the bottleneck, to finding a large amount of unexpected work being done, to determining that work was unnecessary. And only at that last point, once we understood the problem, did it become appropriate to pick a solution. [3] I definitely could have started right off with my laundry list of slow Rails test suite antidotes, and that was my initial instinct, unfortunately. But it would have taken an order of magnitude longer, and I never would have found this particular issue. If you look at the numbers, the end result would've definitely been worse than we got here.

It's a lesson I'm still learning, but collecting actual data about the performance problem itself can give us a lot more insight (for cheaper) than trying out a bunch of solutions. And if you've got DTrace or something comparable, it sure makes gathering that data a lot easier and a lot more fun!

Notes

[1] If you watch a typical DTrace presentation, ustack() works great even with languages that have managed runtimes, because it's on a Solaris machine and not on OS X. And as I've come to learn, ustack helpers are the thing that's there on Solaris/illumos and not on OS X. You can see in our ustack() example above that many of the stack frames are things like ruby`vm_call_method+0x3bc—this tells us that we called some method, but not what method was called. It'd be nice to be able to get that information in ustack(), but alas, it doesn't seem to be possible on any OS X release.

[2] I haven't investigated this question more closely, but would love to hear from you if you know why this could be! I suspected at one point that there were flags being passed to open that avoided any errors, but a quick check with DTrace showed that there were no flags being set. On similar test runs under the same Ruby version (days later), I didn't see this behavior anymore, so there may be something intermittent involved.

[3] Stuart Halloway gave a great talk on debugging at Clojure/conj this year (the first one I've missed attending in person, to my shame). I don't think it directly influenced this process, but I loved the talk and I'm probably unconsciously reframing some of the ideas Stuart outlines there—I highly recommend you check it out.