When is a character not a Character?

When is a character not a Character?

Colin Jones

April 11, 2013

Characters seem like an easy concept to grasp at first glance. , "Java"has a primitive type char and an object wrapper Character, and these are pretty close to what we mean when we talk about characters. However, these data types aren't fully inclusive of the set of characters we might actually encounter in programs. There are good historical / compatibility reasons for this inconsistency, detailed in the Character javadocs, but I'd like to share a debugging story to highlight the difference between the Java types and what we usually mean when we say "character".

Recently I received a bug report for REPLy, where the character represented by "\ud800\udf30" [1], when pasted as input, would immediately exit the program (a nasty bug, to be sure). After a bit of yak shaving, Li-Hsuan and I traced the problem down to a class in Jline responsible for converting bytes from an input stream into characters. Java has a class java.io.InputStreamReader that's responsible for doing exactly this, but it wasn't being used, possibly for historical reasons. The class had been taken from the Apache Harmony codebase, and it correctly handled many non-ASCII UTF-8 characters. All of the characters that worked, however, were a single char in length. The broken ones were non-BMP characters: this was the crux of the problem.

BMP here isn't the image file format, as I learned, but rather the Basic Multilingual Plane. The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like the one in question won't be correctly handled by code that assumes it'll fit into two bytes.

Back to Java: a char is defined to hold 16 bits of data, and the Character type is just a wrapper around that. And as we've just learned, that's precisely enough space to hold the full range of BMP characters. But when it comes to non-BMP characters, which have values greater than 0xffff, we're out of luck.

The right approach is to bridge the language-level chars to semantic characters by using an API like Character.codePointAt(char[], int) to get one real character starting at a given index in the input array. The output is an int, which is what we need to have enough room for these non-BMP characters. Impressively, the class with the bug was already using an array of chars; it just turned out to only be using one element (of a 1-char array) instead of allowing for multi-char characters. This ended up causing an error case where the alternate InputStreamReader would return -1 to indicate the end of the stream. The actual fix was easy, but as usual, understanding the context surrounding the issue was less so.

As I write this, there are currently 417,000 Google results for non-bmp character bug. This kind of confusion is definitely not limited to Java or even the JVM. If you're interested in learning more about how Unicode works, the wikipedia article on Unicode character planes is as good a place to start as any. There's much more complexity here than I've described in this short post, but if this prevents at least one future mistaken assumption about the size of a character, I'm happy.

[1] I'll settle for linking the description here since something in my toolchain, likely either my browser or my font set, doesn't support it within this blog post.