Riak, MapReduce, and large numbers

Riak, MapReduce, and large numbers

Myles Megyesi

January 18, 2013

The Riak database provides a great MapReduce framework allowing you to execute complex queries, leveraging the parallel processing power of distibuted systems. However, I recently discovered some very strange behavior when dealing with large numbers when writing a Javascript MapReduce query.

The problem

To expose this strange behavior, let's first create some data that we can play with.

(Note: for the sake of clarity, all of the calls to Riak shown below use the Riak HTTP API directly, but they could just as easily be executed from one of the Riak client libraries)

$ curl -X PUT -H "Content-Type: application/json" "http://localhost:8098/riak/test_bucket/test-key1?returnbody=true" -d '{"foo": -9223372036854775808}'

{"foo": -9223372036854775808}

$ curl -X PUT -H "Content-Type: application/json" "http://localhost:8098/riak/test_bucket/test-key2?returnbody=true" -d '{"foo": 128}'

{"foo": 128}

$ curl -X PUT -H "Content-Type: application/json" "http://localhost:8098/riak/test_bucket/test-key3?returnbody=true" -d '{"foo": 9223372036854775808}'

{"foo": 9223372036854775808}

I have created 3 records: one with a very large negative number (-2^63), one with a small number, and one with a very large postive number (2^63).

Now that we have stored some test data, let's perform a MapReduce query on our test bucket. This query won't do anything except load up the data and return it.

$ curl -X POST "http://localhost:8098/mapred" -H "Content-Type: application/json" -d '{"inputs":"test_bucket", "query":[{"map":{"language":"javascript","name":"Riak.mapValuesJson"}}]}'

[{"foo":-9223372036854776000},{"foo":9223372036854776000},{"foo":128}]

Hmmmm...that doesn't look like our data. Just to make sure that the data is being stored correctly, let's retreive our records from Riak without MapReduce.

$ curl -X GET "http://localhost:8098/riak/test_bucket/test-key1"

{"foo": -9223372036854775808}

$ curl -X GET "http://localhost:8098/riak/test_bucket/test-key2"

{"foo": 128}

$ curl -X GET "http://localhost:8098/riak/test_bucket/test-key3"

{"foo": 9223372036854775808}

Yep, they look just fine when retreived by key. So, there must be probem with MapReduce queries. After some more digging, I discovered that the Javascript language has some limitations on the Number type, making it a very unfriendly language for working with large numbers. Let's take a look.

$ nodejs
> var i = 9223372036854775808;
undefined
> i
9223372036854776000

Unfortunately, the largest number that can be exactly represented in Javascript is 2^53, which is much smaller than 2^63, the number we are using. So, when 2^63 gets loaded into the Javascript runtime, it is rounded to the closest number that can be represented, which is 9223372036854776000. So, when we run our Javascript MapReduce query, our JSON record gets loaded into the Javascript runtime and our large number gets rounded.

The solution(s)

So far, I've found two ways to avoid this nasty Javascript limitation.

Store your numbers as strings

Convert all your numbers to strings before they are saved and then back to numbers once they are loaded into the application. Obviously, this will add some overhead to saving and loading your records. However, if you need to use the stringified numbers in the MapReduce query to do some calculation, you will have to convert them back to numbers (at your own risk) or use a Javascript library that handles big integers.

Use Erlang for MapReduce

Riak allows you to write your MapReduce queries in either Javascript or Erlang. If you know that you are going to be working with big numbers, it might be best to use Erlang, which does not have the same limitations.

Hopefully, one of these solutions will allow you to store big numbers in Riak and avoid having your data unexpectedly rounded.

Myles Megyesi

Principal Crafter

Myles Megyesi loves design patterns, functional programming, and popcorn. He is an experienced software crafter who enjoys writing software and nurturing its constant growth into something tangible. Throughout his career at 8th Light, he has fulfilled several long-term engagements, focusing on database performance and coordinating distributed services.