Check Your Work

Unit test frameworks automated the tedious work of verifying bits and pieces of code by hand, replacing a haphazard process of trial, error, and judiciously placed print statements with clear, reproducible results.

Learning to use automated tests well has completely transformed the way I code. But sometimes as I watch the little green dots race across my terminal, that nagging question that plagues every programmer recurs: Can we do better?

Generative testing

Why not automate the tedious work involved in writing individual test cases? That's the idea behind generative testing frameworks like Haskell's QuickCheck, which generate random input and test against the expected properties of output instead of its expected value. As a complement to traditional unit tests, generative tests are a powerful tool for catching corner cases unit tests may miss and increasing confidence in a test suite.

For an example of generative testing in action, let's take a look at the Prime Factors Kata. Here's a familiar solution in Clojure—recursive trial division:

(defn trial-division [n divisor factors]
		(if (< n 2)
				factors
				(if (= 0 (mod n divisor))
						(recur (/ n divisor) divisor (conj factors divisor))
						(recur n (inc divisor) factors))))
(defn prime-factors [n]
		(trial-division n 2 []))

We can derive the same solution using both unit and generative tests. The following examples use Speclj for traditional unit tests and Simple-check for generative ones. (All the code from these examples is available as a Github gist if you'd like to try it yourself).

In both cases, the first step towards a solution is the simplest base case:


(describe "Prime factorization"
		(it "returns [] for 1"
				(should= [] (prime-factors 1))))

But from here out, the assertions diverge. A unit test isolates the next-simplest input and compares it with a single expected output:

(it "returns [2] for 2"
(should= [2] (prime-factors 2))))

Instead of including a single assertion, a Simple-check spec is a "higher-order" test: it generates random inputs to the prime-factors function and checks that every output satisfies some property that should always hold true. In practice, the test looks like this:


(def primes (load-primes "primes.txt"))
(check-that "A prime number's only prime factor is itself" 1000
		(prop/for-all [prime (gen/elements primes)]
				(= [prime] (prime-factors prime))))

A typical Simple-check test has two parts: a generator that creates random input according to some rule, and a property that describes some characteristic of the expected output. Here, load-primes reads in a list of the first 1000 prime numbers. The generator gen/elements chooses one of them.

The property described in the body of prop/for-all asserts that the prime factorization of any randomly selected prime should be a vector containing the same prime. We're testing the same concept behind the unit test, but notice that "1000" next to the spec description: this test generates the equivalent of 1000 assertions instead of just one.

Generators can return basic datatypes, collections, and combinations thereof. They are composable, so it's easy to generate complex inputs that satisfy specific requirements with a few concise lines of code.

Here are the next few tests and their generative equivalents. In each case, the generative test tries to capture a general rule rather than a specific expectation:

Testing [2 2] is a step towards simple factorization.

(it "returns [2 2] for 4"
				(should= [2 2] (prime-factors 4)))

Why not check two times any prime?

(check-that "Factoring 2 * a prime returns 2 and the prime" 1000
												(prop/for-all [prime (gen/elements primes)]
																										(= [2 prime] (prime-factors (* 2 prime)))))

Testing nine is the first case that returns two primes greater than two.

(it "returns [3 3] for 9"
				(should= [3 3] (prime-factors 9)))

But we want it to work similarly for all primes:

(check-that "Factoring the product of 2 primes returns both primes" 1000
												(prop/for-all [prime-one (gen/elements primes)
																											prime-two (gen/elements primes)]
																										(= (sort [prime-one prime-two])
																													(prime-factors (* prime-one prime-two)))))

The three Simple-check specs above contain the equivalent of 3000 assertions, and we've only written 11 lines of code!

Fail better

Simple-check does more than just generate tests. It also helps generate more useful failures. The following specs both contain a subtle typo. In the first, I forgot a 2 among the jumbled multiplication:

(it "returns [2 2 2 3 3 5 13 23 131] for 14100840"
				(should= [2 2 2 3 3 5 13 23 131]
													(prime-factors (* 3 5 2 23 2 3 13 131))))

Here, I forgot that prime-factors returns a sorted vector when comparing against prime-seq:

(check-that "Factoring the product of a sequence of primes returns
	the primes in sorted order" 1000
(prop/for-all [prime-seq (gen/vector (gen/elements primes))]
(= prime-seq
(prime-factors (apply * prime-seq))))))

The failed unit test returns something like the following. The missing 2 is hard to spot.

Expected: [2 2 2 3 3 5 13 23 131]
got: [2 2 3 3 5 13 23 131] (using =)

Simple-check goes a step further. In addition to returning the failed case, it tries to 'shrink' the input and find the minimal case that fails the spec by constructing and searching a tree of possible inputs. For example, the failing spec above returns a map like this:

{:result false,
	:failing-size 3,
	:num-tests 4,
	:fail [[3697 1949 5479]],
	:shrunk
	{:total-nodes-visited 81,
		:depth 15,
		:result false,
		:smallest [[3 2]]}}

Here, :fail shows the failed test case and :smallest shows the result of the shrinking step. In this example, four test cases passed before the output [3697 1949 5479] violated the property we declared: (prime-factors (* 3697 1949 5479)) did not return the vector [3697 1949 5479]. After shrinking this input, the 'smallest' failing test case is the vector [3 2], which highlights the error in the spec: prime-seq should have been in sorted order.

Generating thousands of test cases is overkill for a bite-size problem like this one, but it can be useful in the real world. Generative tests excel at catching complex edge cases in concurrent processes and distributed systems—see John Hughes, author of the original Haskell QuickCheck, describing his heroic debugging of concurrent writes to Riak clusters and race conditions in Erlang's Mnesia database.

Plus, generative tests encourage a different way to reason about code. As Hughes puts it in the original QuickCheck paper: "one of the major advantages of using QuickCheck is that it encourages us to formulate formal specifications, thus improving our understanding of our programs."

Of course, the benefits of generative tests don't come without costs. Runtime scales roughly linearly, so that impressive set of properties may take 1000 times as long to finish as one-off assertions that drive the same code. And they aren't always the best solution for test-driven development, especially when exploring a new problem space.

In preparing this post, I started with the Coin Changer Kata, assuming that its tricky edge cases would be a good match for generative tests. But after writing a few Simple-check tests, I was stuck. I knew the specific case I was trying to describe—changing 80 cents with denominations of 25, 20, and 1 should return four 20 cent coins instead of three quarters and five pennies—but I was left scratching my head when I tried to generalize it.

Describing invariants for all inputs is hard, especially when one tries to describe a function that's not even finished. But a single counterexample in a unit test is enough to drive a TDD solution forward.

Trust, but verify

So can we do better? I think so, but by complementing unit tests instead of replacing them. "Show your work" and "check your work" are different concepts, and automated tests combine both in different ways.

Before learning TDD, I thought testing was mostly about proving correctness. I've since learned that tests do much more. Unit tests provide a model of the way we think our programs work, a method that drives better design, and ideally, a record of the steps along the way to a solution. They skew towards "show your work"—providing a fast, lightweight record of reasoning that happens to blow up when things break.

Generative tests provide a robust way to "check your work" once the substructure of unit tests are in place. But although they're a powerful tool, even exhaustive testing is not a proof of correctness. No matter how extensive, tests will never be enough to know a system works properly. (And just try pitching a formal proof of your Rails controllers to the rest of your team). But that's okay—it's the process of subjecting code to tests that matters more than proving the results.

By unraveling "check your work" from "show your work" when thinking about the strengths and trade-offs of different test methodologies, we can test more code more reliably. We may already test all the things we can think of. But we should aspire to test all the things we haven't thought of, too.

Generative testing

Fail better

Trust, but verify

Generative testing resources

Empowering Game Developers with Faster Builds and Better Tools

Essential Strategies for Upgrading Microservices Without Downtime

Getting Started with TDD: A Practical Guide to Beginning a Lasting Practice