Flaky crusts: Good for pies; bad for test suites

Flaky crusts: Good for pies; bad for test suites

Colin Jones

October 22, 2014

Unit tests that fail intermittently, without exposing any production bugs, are the worst. Like the boy who cried wolf, they reduce our confidence that they're telling us the truth. Without that confidence, test suite failures become everyday occurrences, and when a real bug rears its ugly head, we may miss it.

In this post, we're going to look at, and learn to avoid, one of the most annoying and time-wasting causes of flaky tests: test pollution. With some varieties of flaky tests, we can look at the failing build, look at the test, and spot the error right away. Unfortunately, that's not often the case with pollution. We're talking here about operations performed in one test that affect the outcome of other tests. In order to avoid pollution, every test should be self-contained and unaffected by whatever happens in other tests.

Of course, most testing libraries make every effort to protect us from these kinds of mistakes. JUnit gives us a fresh instance of the test class for every test, preventing us from changing fields in one test and having those changes propagate to the next one. RSpec's mock object library resets all the mocks and stubs before running each test.

But we application programmers still can and do shoot ourselves in the foot, by finding some shared mutable state to muck up.

Class variables (aka global variables)

In object-oriented languages, this shared mutable state is often held in class variables:

class System
		class << self
				attr_accessor :current_user

class Authentication
		def login_user(user)
				if user.valid_credentials?
						System.current_user = user
						System.current_user = nil

And as we all know, class variables are globals, and global variables have some issues. What harm can they cause here? Take the following tests, for example, each reasonable on its own:

# spec/system_spec.rb
require 'rspec'
require 'system'

describe System do
		describe '.current_user' do
				it 'is nil by default' do
						expect(System.current_user).to eq(nil)
# spec/authentication_spec.rb
require 'rspec'
require 'authentication'
require 'system'
require 'user'

describe Authentication do
		before do
				@auth = Authentication.new
				@valid_user = User.new(:login => "guest",
				:password => "guest")
		describe '#login_user' do
				it 'logs a user in' do
						expect(System.current_user).to eq(@valid_user)

Run these two tests individually, as many times as you want, and they'll always pass. Run them in your test suite, and they'll always fail! Run them in a random order, and you'll fail or pass sometimes, depending on the ordering between them. These tests read and clobber, respectively, the shared mutable state in System.current_user. So unless we take explicit action when tearing down test #1, or when setting up test #2, we're dependent on test ordering to determine whether our test suite succeeds or fails.

Now that we've found the bug, the fix is clear: reset any shared mutable state. We could do this either as part of the test setup or teardown—I don't particularly care which, as long as it's consistent.

Incidentally, things like thread-local variables can have exactly the same issues in environments where threads come from a thread pool (which is to say, most environments). A newly-checked-out thread, without some tweaking, will still have the state from its previous use. So don't be fooled into thinking you're safer using thread-local variables than class variables.

External data

External datastores or services are another obvious locus of test pollution. There are a few fine strategies out there for tearing down database state after tests: running the whole test in a transaction, truncating the tables after the test run, etc. If your continuous integration build allows multiple concurrent test runs to happen against the exact same database, you're in a similar situation. This kind of test pollution is across test suites, which is even nastier because any given test suite may always pass when run alone, but then fail in CI. If this sort of setup is necessary, only the transaction solution will work, and even then only if the database isolation levels allow one suite's activity to be invisible to the other. Of course, if those isolation levels don't match what's in production, there may be conflicting goals here.

In some test suites, the main database may be safe to change during tests due to some automatic teardown, but other datastores may be problematic. Perhaps there are interactions with a cache, a message queue, another local internal application, a third-party API, or even the filesystem. If we're following Michael Feathers's definition of a unit test, we won't call these tests "unit tests." Nevertheless, some tests are bound to interact with these kinds of external stores, and those tests are at risk of pollution if we're not careful with state.

State hangs out in weird places

If you're using Clojure, all this probably sounds obvious. We've heard over and over that shared mutable state is a problem, and we know that we need to be using it only in disciplined ways. We avoid a lot of these problems by default in Clojure, but there are some non-obvious places where it can rear its ugly head. Check this out:

(defn find-organization [id]
		(or (db/find *db-connection* "organizations" id)
(def get-organization (memoize find-organization))

This memoized function was written with the (probably naive) assumption that the organizations in the database are static, never changing while the production app is running. This lets us skip database lookups when we'd already made them. More concretely, get-organization has a memoization cache that is never cleared. Our assumption may actually (in rare cases) have been valid for our production code, but a good set of tests is likely to want to cover multiple possible states of the database.

So we aren't using the typical state atom or ref that you'd usually look for in Clojure when you're hunting test pollution. But nevertheless, a test could clear out or with-redefs away all the database state all it wanted, and it could still get results from a previous test run if the test result is dependent on the return value from get-organization.

How do you find the pollution?

So you've got this huge test suite that fails some of the time and you suspect it might be due to test pollution. How do you track down the culprit(s)? A good first approach is to take a close look at the failing test(s), and pore over every interaction with shared mutable state there. Luckily, this is usually enough to track down what kind of pollution is going on.

But depending on how pervasive the interactions with shared mutable state gets, you first need to isolate a test ordering that causes the failure. Then do a binary search (backing up from the beginning) to see which test causes the pollution. One client I worked on had a handy script to do this for their test suite. I can imagine pathological cases that depend on 3 or more tests interacting with each other in a strange way that breaks the binary search approach, but in practice it's not something I've seen.

Test pollution is one of those rare but annoying things that loses developer-days all the time, both for the people trying to track the problem down and for the people waiting for a green build to be able to merge their code to the master branch. If we take a bit more care when interacting with shared state, both in our test suites and production code, we'll save ourselves quite a bit of pain and frustration.