Ask DTrace: Why are my tests so slow?

We were in pain. We'd just joined a new Rails project, and the test suite took over 10 minutes to run. I've seen worse, but that was before I knew anything about DTrace. I wanted to know why it was so slow, and what it'd take to make it fast.

Last time, we saw a few neat things DTrace can do, but now we're going to see a practical example of how it can actually help you out day-to-day. To help motivate you, I'll skip right to the punch line: we cut the test suite's running time by better than half with a 1-line change, and DTrace was crucial in tracking down the cause of the slowness.

Now let's roll back in time to the beginning: we have a slow test suite, and I've got no shortage of theories as to what's making it slow. I've seen seconds or minutes shaved off Rails test suites by moving from factories to fixtures, avoiding slow ActiveRecord lifecycle hooks, turning off observers, decoupling from Rails entirely, and plenty of other strategies. Every Rails developer who cares about fast test feedback has tried these things. But lately I've been understanding more and more that just as with other hard problems, performance problems are best solved by first understanding the problem, before jumping to possible solutions. In walked DTrace.

First on the agenda: what resources are being consumed most heavily during the test run? Is it disk? network? CPU? memory? Brendan Gregg's USE method page for OS X lists some great tools to scratch the surface of performance problems.

All of the following command-line examples were performed with the test suite running, after the initial code-loading and startup stuff (that'd be interesting to explore too, but outside the scope of this investigation).

Where is the system spending time?

We're definitely using some disk and CPU:

$ iostat 1
          disk0           disk2       cpu     load average
    KB/t tps  MB/s     KB/t tps  MB/s  us sy id   1m   5m   15m
    5.62 311  1.70     0.00   0  0.00  20 14 66  1.93 1.99 2.15
   42.83 273 11.43     0.00   0  0.00  19 10 71  1.93 1.99 2.15
   30.04 112  3.27     0.00   0  0.00  21  9 70  1.93 1.99 2.15
   26.71 141  3.68     0.00   0  0.00  19 10 71  1.93 1.99 2.15

The first line shows that when the metrics were collected, the CPU was 20% user time (us), 14% system/kernel time (sy), and 66% idle (id).

It's hard to know how to interpret these numbers on my development machine, since there may be other things going on like system indexing, web browsing, automated backup, etc. But we have to start somewhere! The load average seems kind of high since I only have two processors on my laptop. Let's take a quick look at top, sorted by CPU usage, to make sure it's the test run that's causing the CPU load.

$ top -o cpu
Processes: 272 total, 3 running, 3 stuck, 266 sleeping, 1666 threads                                                                                                                                                                 09:41:13
Load Avg: 2.00, 2.01, 2.18  CPU usage: 26.63% user, 8.52% sys, 64.83% idle  SharedLibs: 1452K resident, 0B data, 0B linkedit. MemRegions: 56624 total, 3132M resident, 61M private, 412M shared.
PhysMem: 7024M used (1480M wired), 1166M unused. VM: 681G vsize, 1026M framework vsize, 13730951(0) swapins, 14769117(0) swapouts. Networks: packets: 42172600/23G in, 26937732/10G out. Disks: 26856753/266G read, 13907921/306G written.

PID    COMMAND      %CPU TIME     #TH   #WQ  #PORT MEM    PURG   CMPRS  PGRP  PPID  STATE    BOOSTS          %CPU_ME %CPU_OTHRS UID  FAULTS    COW      MSGSENT    MSGRECV    SYSBSD      SYSMACH    CSW         PAGEINS  IDLEW    POWE
76355  ruby         93.9 00:05.76 2/1   0    17    164M+  0B     0B     76355 67443 running  *0[1]           0.00000 0.00000    501  48477+    605+     146        62         1006130+    517+       2840+       0        0        93.9
280    mds_stores   33.2 37:16.48 6     4    66-   90M+   1516K  34M    280   1     sleeping *0[1]           0.00000 32.99901   0    28708816+ 50302    28020174+  13965778+  93394291+   25841994+  32170381+   1988248+ 89348    33.2
64     mds          5.1  62:11.95 10    7    607+  25M-   0B     33M    64    1     sleeping *0[1]           0.00000 3.78791    0    47422915+ 2888     100960224+ 51541514+  346518839+  80922925+  86856281+   3131     313515   5.1
350    Finder       1.4  04:30.63 9     3    388+  26M-   0B     38M    350   1     sleeping *0[7504+]       2.79131 0.00000    501  3820018+  2905     1165710+   321061+    10432182+   2981232+   3213172+    37538    11310    1.4
342    iTerm        1.4  40:25.64 9     1    605   36M    0B     48M    342   1     sleeping *0[22716+]      0.00000 0.09288    501  2567934+  85616    7940320+   1976360+   63129208+   21811017+  10492837+   8965     330277   1.4

Interestingly, while the total system-wide user + system CPU percentages from top only add up to ~35%, the CPU load in top for just the Ruby process is almost 94%! See Brendan Gregg's Broken Performance Tools slides for more on how top and other tools can be misleading. Ultimately, the conclusion in our case is that our tests run on a single processor at a time (of the two total), so we may indeed be CPU-bounded here, even though we have one entire CPU idle (50% of the machine). Because the system call percentage is a decent percentage of the total CPU percentage, I'm interested in what system calls are taking up time.

Digging into the details

So let's use DTrace to see what system calls (syscall:::entry) are triggered by our Ruby process (/execname=="ruby"/), and sum the system call counts into a DTrace aggregation (@[probefunc] = count();) whose top five entries we can inspect every second before clearing it for the next second's results (tick-1s { trunc(@,5); printa(@); trunc(@);}). Check out the DTrace Guide if you're looking for a DTrace tutorial—I'll have more DTrace examples here but won't go into detail on the language. Here's what this looks like:

$ sudo dtrace -n 'syscall:::entry /execname=="ruby"/ { @[probefunc] = count(); } tick-1s { trunc(@,5); printa(@); trunc(@);}'
dtrace: description 'syscall:::entry ' matched 491 probes
CPU     ID                    FUNCTION:NAME
  0 278139                         :tick-1s
  fcntl_nocancel                                                 1102
  fstat64                                                        1102
  open_nocancel                                                  1102
  stat64                                                         1162
  open                                                         117372

  0 278139                         :tick-1s
  fcntl_nocancel                                                 1146
  open_nocancel                                                  1146
  fstat64                                                        1150
  stat64                                                         1301
  open                                                         122042

  0 278139                         :tick-1s
  fcntl_nocancel                                                 1056
  fstat64                                                        1056
  open_nocancel                                                  1056
  stat64                                                         1116
  open                                                         112358

Wow. There are lots and lots of open calls, around 100x the calls to anything else! There are over 100,000 calls to open every second! It seems like an awful lot. It might seem normal to open lots of files during startup, when language and library code is being loaded, but the only files I knew of that needed to be opened were the test files themselves—and we sure as heck weren't running anything approaching 100k test files per second.

OK, so what was causing all the open calls? The ustack() action doesn't always help [1], but in this case it worked OK for our purposes: the top hits all look related to something called psych.bundle.

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby"/ { @[ustack()] = count(); } tick-1s { trunc(@,3); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
 0 278139                         :tick-1s
              libsystem_kernel.dylib`__open+0xa
              ruby`rb_require_safe+0x35a
              ruby`rb_protect+0xd0
              ruby`load_encoding+0x118
              ruby`rb_enc_find_index+0xa9
              psych.bundle`parse+0x1c6
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield+0x63
              ruby`rb_hash_fetch_m+0x75
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield+0x63
              ruby`rb_ary_collect+0x79
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield+0x63
             7952

              libsystem_kernel.dylib`__open+0xa
              ruby`rb_require_safe+0x35a
              ruby`rb_protect+0xd0
              ruby`load_encoding+0x118
              ruby`rb_enc_find_index+0xa9
              psych.bundle`parse+0x1b7
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield+0x63
              ruby`rb_hash_fetch_m+0x75
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield_values+0x15d
              ruby`invoke_block_from_c+0x1e7
              ruby`rb_yield+0x63
              ruby`rb_ary_each+0x49
              ruby`vm_call0+0x1ed
              ruby`rb_iterate+0x168
            12496

              libsystem_kernel.dylib`__open+0xa
              ruby`rb_require_safe+0x35a
              ruby`rb_protect+0xd0
              ruby`load_encoding+0x118
              ruby`rb_enc_find_index+0xa9
              psych.bundle`parse+0x1c6
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield+0x63
              ruby`rb_hash_fetch_m+0x75
              ruby`vm_call_method+0x3bc
              ruby`vm_exec_core+0x2583
              ruby`vm_exec+0x72
              ruby`rb_yield_values+0x15d
              ruby`invoke_block_from_c+0x1e7
              ruby`rb_yield+0x63
              ruby`rb_ary_each+0x49
              ruby`vm_call0+0x1ed
              ruby`rb_iterate+0x168
            12609

OK, so Psych is a YAML library (which I think I knew at some point). Now I'd like to know which files are being read. Maybe this is something to do with some configuration files or test fixtures?

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby"/ { @[copyinstr(arg0)] = count(); } tick-1s { trunc(@,10); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0 278139                         :tick-1s
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_be.so.rb               64
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.bundle               64
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.rb               64
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.bundle               66
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.bundle               96
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.bundle               96
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.rb               96
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.bundle               96
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.rb               96
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.bundle               99

  0 278139                         :tick-1s
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_be.bundle               70
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_be.so.rb               70
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.bundle               70
  /Users/colin/Projects/clients/redacted/config/lib/enc/utf16_le.so.rb               70
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.bundle              102
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.bundle              102
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.bundle              105
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.rb              105
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.bundle              105
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.rb              105

Huh, that's pretty weird—I have no idea what those files are. There's nothing in the application code that refers to "utf16le" or "utf16be". In fact, on a closer look, it turns out these files don't actually exist—so these system calls must actually be failing. Since these all look pretty similar, what if we zoom back out and show the top 50 instead of only the top 10?

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby"/ { @[copyinstr(arg0)] = count(); } tick-1s { trunc(@,50); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3 278139                         :tick-1s
  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/1.9.1/x86_64-darwin13.2.0/enc/utf16_be.so.rb               39
  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/1.9.1/x86_64-darwin13.2.0/enc/utf16_le.so.bundle               39
  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/enc/utf16_be.bundle               39
  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/enc/utf16_be.so.rb               39
  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/vendor_ruby/enc/utf16_le.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/api/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/api/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/api/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/assets/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/assets/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/assets/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/controllers/concerns/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/controllers/concerns/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/controllers/concerns/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/controllers/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/controllers/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/controllers/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/helpers/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/helpers/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/helpers/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/mailers/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/mailers/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/mailers/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/models/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/models/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/models/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/app/workers/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/app/workers/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/app/workers/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/lib/assets/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/lib/assets/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/lib/assets/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/lib/tasks/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/lib/tasks/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/lib/tasks/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/test/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/test/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/test/enc/utf16_be.so.rb               39
  /Users/colin/Projects/clients/redacted/vendor/enc/utf16_be.bundle               39
  /Users/colin/Projects/clients/redacted/vendor/enc/utf16_be.so.bundle               39
  /Users/colin/Projects/clients/redacted/vendor/enc/utf16_be.so.rb               39
  /usr/local/Cellar/rbenv-gem-rehash/1.0.0/enc/utf16_be.bundle               39
  /usr/local/Cellar/rbenv-gem-rehash/1.0.0/enc/utf16_be.so.rb               39
  /usr/local/Cellar/rbenv-gem-rehash/1.0.0/enc/utf16_le.so.bundle               39
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.bundle              114
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.bundle              114
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_le.so.rb              114
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.bundle              117
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.bundle              117
  /Users/colin/Projects/clients/redacted/lib/enc/utf16_be.so.rb              117

Same deal as before: the file doesn't exist in any of these locations where the Ruby app is searching. Let's get DTrace to see where the actual file eventually gets loaded from, by tracking the open calls with "utf_" as part of the filename, where the return value is not -1 (the error value, from man 2 open). We can use a DTrace thread-local variable (self->location) to carry the filename across from the syscall entry to the return:

$ sudo dtrace -n 'syscall::open:entry /execname=="ruby" && strstr(copyinstr(arg0), "utf16_") != NULL/ { self->location = arg0; } syscall::open:return /self->location != NULL && arg0 > -1/ { @[arg0, copyinstr(self->location)] = count(); } tick-1s { printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  1 278139                         :tick-1s
               20  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/rack-mount-0.8.3/lib/rack/mount/vendor/regin/enc/utf16_be.so.bundle                2

  1 278139                         :tick-1s

  1 278139                         :tick-1s
               20  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/rack-mount-0.8.3/lib/rack/mount/vendor/regin/enc/utf16_be.so.bundle                7

  1 278139                         :tick-1s

  1 278139                         :tick-1s
               20  /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/gems/1.9.1/gems/rack-mount-0.8.3/lib/rack/mount/vendor/regin/enc/utf16_be.so.bundle                4

  1 278139                         :tick-1s

  1 278139                         :tick-1s

  1 278139                         :tick-1s

OK, so this part is a little weird: that file location also doesn't exist, but for some reason open(2) isn't failing. I'm genuinely interested in knowing how this can be [2], but that location is the only one that shows up. Plus, since it definitely doesn't exist in reality, I'm starting to feel confident that the file isn't ever getting loaded from any of the locations that Ruby is trying to open. So I'm not going to dig into it any further for now.

Strange things are afoot

At this point, based on the list of paths we're attempting to open, a problem is becoming clear: something is searching the whole Ruby $LOAD_PATH looking for two libraries: enc/utf16_le and enc/utf16_be (with suffixes of .bundle, .so.bundle, and .so.rb). And our $LOAD_PATH, like those in many other Rails projects, is pretty big:

$ bundle exec ruby -e 'puts $LOAD_PATH.size'
236

So every time whatever-it-is looks for these shared library files, it might have to look in 236 * 6 = 1416 places before finding the file it's looking for. We'd only have to do this thing ~1000 times per second in order to get the 100k total open attempts per second that we've observed.

Where could this file actually live, and why isn't it on the $LOAD_PATH?

$ find ~/.rbenv/versions/1.9.3-p194 -name "utf16_be*"
$ find ~/.rbenv/versions/1.9.3-p194 -name "utf16_le*"

Nothing—these files just don't exist! Maybe there's a directory called enc where the files are supposed to live?

$ find ~/.rbenv/versions/1.9.3-p194 -name "enc" -type d
/Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc
$ ls /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc
big5.bundle         euc_jp.bundle       gb2312.bundle       iso_8859_11.bundle  iso_8859_16.bundle  iso_8859_5.bundle   iso_8859_9.bundle   trans               utf_32le.bundle
cp949.bundle        euc_kr.bundle       gbk.bundle          iso_8859_13.bundle  iso_8859_2.bundle   iso_8859_6.bundle   koi8_r.bundle       utf_16be.bundle     windows_1251.bundle
emacs_mule.bundle   euc_tw.bundle       iso_8859_1.bundle   iso_8859_14.bundle  iso_8859_3.bundle   iso_8859_7.bundle   koi8_u.bundle       utf_16le.bundle
encdb.bundle        gb18030.bundle      iso_8859_10.bundle  iso_8859_15.bundle  iso_8859_4.bundle   iso_8859_8.bundle   shift_jis.bundle    utf_32be.bundle

Whoa. The files are right there, at the end of the list! Why couldn't find find them before? Is find broken?! Take a closer look: utf_16be.bundle is the file that actually exists, and utf16_be.bundle is the file we're looking for. Could this be the issue? Let's try copying the files to get them in both locations:

$ cp /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc/utf{_16be,16_be}.bundle
$ cp /Users/colin/.rbenv/versions/1.9.3-p194/lib/ruby/1.9.1/x86_64-darwin13.2.0/enc/utf{_16le,16_le}.bundle

Unfortunately, it doesn't seem to help much. We've got some successful file opens now, but we're still making a similar number of open calls per second, and the test suite isn't really any faster. I don't know what these files are supposed to do, so maybe the naming is somehow tied to each file's contents.

That was kind of a silly plan anyway—we're not going to win any friends by asking developers to copy files around within their Ruby installation. What's next? I googled around a bit and also pulled down the psych source code to look for "utf16"-related things:

$ grep -r -i utf16_be .
       0
$ git log -Sutf16_be -i --oneline | wc -l
       2
$ git log -Sutf16_be -i --oneline | head -n2
0845b07 * ext/psych/parser.c (transcode_string): fix encoding index names.   Thanks markizko for reporting.
9eb1264 * ext/psych/parser.c: set parser encoding based on the YAML input   rather than user configuration. * test/psych/test_encoding.rb: corresponding tests. * test/psych/test_parser.rb: ditto * test/psych/test_tainted.rb: ditto

Aha! No instances of utf16_be in the master branch, but git log -S reveals that it used to be there, but was removed in mid-2012 with the "fix encoding index names" commit. What happens if we bump the version of Psych to the latest? I won't bore you with the details (and I haven't really investigated deeply), but that breaks all the tests. Could we just bump it to the earliest version where commit 0845b07 appears, as a smaller step that's less likely to break compatibility? Bumping the Ruby patch-level version from 1.9.3-p194 to 1.9.3-p286 would have the same results, because the Ruby distribution itself includes Psych, and Ruby 1.9.3-p286 includes that Psych commit.

Trying potential solutions

Since we have to pick one approach or the other, let's update our project's .ruby-version file to use 1.9.3-p286, restart the test suite under the updated Ruby version, and see if that does the trick:

$ sudo dtrace -n 'syscall:::entry /execname=="ruby"/ { @[probefunc] = count(); } tick-1s { trunc(@, 10); printa(@); trunc(@);}'
dtrace: description 'syscall::open:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0 278139                         :tick-1s
  __pthread_sigmask                                               156
  lstat64                                                         190
  sendto                                                          310
  recvfrom                                                        315
  write                                                           502
  stat64                                                          503
  madvise                                                         841
  select                                                         1176
  sigaltstack                                                    2550
  sigprocmask                                                    2550

  0 278139                         :tick-1s
  __pthread_sigmask                                                71
  poll                                                             98
  sendto                                                          111
  recvfrom                                                        113
  write                                                           189
  lstat64                                                         205
  select                                                          387
  stat64                                                          388
  sigaltstack                                                     926
  sigprocmask                                                     926

Now open doesn't even make it into the top 10 system calls, which is a heck of a lot better than seeing it dominate at 100k calls per second! Running the whole suite a couple of times under each version confirmed that this pays off in the real world:

Baseline (1.9.3-p194) Updated (1.9.3-p286)
Trial 1 661s 297s
Trial 2 665s 305s

So with a one-line configuration change, whether we decide to bump the Ruby patch-level version or to bump just Psych (as expected, both worked), we're the proud new owners of a five-minute test suite. Some teammates had even more dramatic results, with the speedup looking closer to 3x than "just" 2x. Waiting five minutes to know I haven't broken the tests is still longer than I'd like, but we've got a nice payoff for a few hours' investment.

Understanding the problem: it's the real thing

I want to circle back a bit and emphasize here that while the specific issue that caused the slowdown is interesting, much more interesting is that we went from a vague feeling of "these tests are too slow," to having some idea what resource(s) might be the bottleneck, to finding a large amount of unexpected work being done, to determining that work was unnecessary. And only at that last point, once we understood the problem, did it become appropriate to pick a solution. [3] I definitely could have started right off with my laundry list of slow Rails test suite antidotes, and that was my initial instinct, unfortunately. But it would have taken an order of magnitude longer, and I never would have found this particular issue. If you look at the numbers, the end result would've definitely been worse than we got here.

It's a lesson I'm still learning, but collecting actual data about the performance problem itself can give us a lot more insight (for cheaper) than trying out a bunch of solutions. And if you've got DTrace or something comparable, it sure makes gathering that data a lot easier and a lot more fun!

Notes

[1] If you watch a typical DTrace presentation, ustack() works great even with languages that have managed runtimes, because it's on a Solaris machine and not on OS X. And as I've come to learn, ustack helpers are the thing that's there on Solaris/illumos and not on OS X. You can see in our ustack() example above that many of the stack frames are things like ruby`vm_call_method+0x3bc—this tells us that we called some method, but not what method was called. It'd be nice to be able to get that information in ustack(), but alas, it doesn't seem to be possible on any OS X release.

[2] I haven't investigated this question more closely, but would love to hear from you if you know why this could be! I suspected at one point that there were flags being passed to open that avoided any errors, but a quick check with DTrace showed that there were no flags being set. On similar test runs under the same Ruby version (days later), I didn't see this behavior anymore, so there may be something intermittent involved.

[3] Stuart Halloway gave a great talk on debugging at Clojure/conj this year (the first one I've missed attending in person, to my shame). I don't think it directly influenced this process, but I loved the talk and I'm probably unconsciously reframing some of the ideas Stuart outlines there—I highly recommend you check it out.

Colin Jones, Chief Technology Officer

Colin Jones is particularly interested in web security, distributed systems, performance, and functional programming.

Interested in 8th Light's services? Let's talk.

Learn more about our Ruby services

Contact Us