Don't Make Me Remember Things

Don't Make Me Remember Things

Rob Looby

January 28, 2016

As a developer there are an infinite number of things to remember: everything from intricacies of the language you’re using to the domain knowledge of your current project. Just about everything can be readily looked up, but if you look up every little thing you’ll never get anything done (and likely annoy your pair). A good working knowledge of common functions in your language’s standard library is usually expected. But is everything in even the “standard” library easy to remember?

Have the creators and maintainers of that library made choices that make it easy for you, the developer using it, to avoid bugs? I want to take a look at the choices that were made in designing a simple, common standard library function: finding the index of a substring in a string. What is a developer expected to remember in order to properly use this function? And how easy is it for them to forget and introduce a bug?

Let’s start with Ruby. In Ruby the String#index method returns the 0-based index of the substring if it is found, or nil if it is not found. [1] Returning nil for the negative case is common in Ruby and makes sense in some contexts (for example it is “falsy” in boolean checks). In many contexts though it is just a special value that means “not found.” The function could just as easily return a scream-cat-emoji with no loss of meaning.

Having this special value puts the responsibility on the developer to remember that the method could return nil and that case likely has to be handled differently than when the substring is found. A search for the index that isn’t paired with a nil check is likely a bug that is waiting to happen, but nothing other than you remembering that will keep you from doing it. Sure, you should probably have a test for that case, but that just means you have to remember to write that test case, so we’re back where we started.

There are some who would argue that the problem with the behavior above is caused by Ruby’s dynamic type system. That nil is a different type than an integer and thus has to be handled differently. In this mindset it makes no sense to return these different things from a single function, and static typing would have helped the developer catch such a bug. So let’s take a look at the same function in a popular statically typed language, Java.

In Java, the String#indexOf function returns the 0-based index of the substring if it is found (as in Ruby), but returns -1 if the substring is not found. Here the type of the return value is the same either way, but -1 is still just a special value meaning “not found.” The only thing special about -1 is that it is less than 0 (so is -2 but I guess -1 is easier to remember). We’ve traded an index.nil? check for an index == -1 check, but the type system is doing little to help us avoid such bugs.

Really what I’m talking about here is being able to look at my code and easily know what it is doing and that it is doing it correctly, without bugs. Some would call this “reasoning about” their code, and assert that functional languages give them greater power to perform this type of analysis. Let’s take a look at Clojure, which is a functional language with a dynamic type system. The traditional way to call the String#indexOf function in Clojure is just (.indexOf “string” “substring”), which is actually just calling the same Java string method discussed above (which should make Java developers more comfortable). However, with the release of Clojure 1.8 the string module now has an index-of function that behaves the same as the indexOf function in Ruby (which is sure to trip up both Java and Clojure developers alike). Neither one of these help me reason about my code or do anything to help me avoid writing bugs.

Trying again, let’s take a look at Purescript, which is a statically typed functional language similar to Haskell. [2] Purescript does have an indexOf function, and it has a return type of Maybe Int. This encodes in the type signature that the substring may not be found. The code will not even compile if the return value is used in a way that does not clearly handle the cases of Just <some int> when the substring is found and Nothing when it is not. The developer is of course free to take those values and use them or ignore them, but they could not forget to handle the Nothing case in some way.

There is nothing keeping the implementation of this function in all other statically typed languages from behaving the same way. Making this function return an integer rather than some optional type is a decision made by the creators of that standard library. In doing so, they’ve added one more thing to the ever-growing list of things someone using their language must remember, and one more place to introduce a bug if they forget.

Try to keep this in mind when designing a library. How much is someone using your library expected to remember? If they forget, how easy is it for them to catch their mistake?

Footnotes

[1] That is, of course, if nobody has hacked open the String class and redefined it. This may sound like a joke to some of you, but ask around to any Ruby developers you know who have been on legacy projects. When someone gets the thousand-yard stare instead of laughing, you’ll know what I’m talking about.

[2] I originally wanted to use Haskell here, but Googling for “Haskell string indexOf” doesn’t turn up much. The most promising link is a post on Quora asking “How do I find the index of a substring in Haskell?” The first answer just asks the original poster if perhaps they are looking for the Knuth-Morris-Pratt algorithm, which finds the answer in O(n+m) rather than the naive O(n^2). The second answer suggests that the original poster probably doesn’t need this function at all and should start thinking in terms of higher level functions. It turns out strings in Haskell are really just lists of characters. Why they stopped there when characters are just numbers and numbers are just 1s and 0s is left unanswered. The point relevant to this post is no such function exists in the standard library and the user is left to write their own. I’d suggest taking a look at the Knuth-Morris-Pratt algorithm, which finds the answer in O(n+m) rather than the naive O(n^2).