Characters seem like an easy concept to grasp at first glance. , "Java"has a
primitive type char
and an object wrapper Character
, and these are pretty
close to what we mean when we talk about characters. However, these data types
aren't fully inclusive of the set of characters we might actually encounter in
programs. There are good historical / compatibility reasons for this
inconsistency, detailed in the Character
javadocs,
but I'd like to share a debugging story to highlight the difference between the
Java types and what we usually mean when we say "character".
Recently I received a bug report for
REPLy, where the character represented by
"\ud800\udf30"
[1], when pasted as
input, would immediately exit the program (a nasty bug, to be sure). After a
bit of yak shaving, Li-Hsuan and I traced the
problem down to a class in Jline responsible
for converting bytes from an input stream into characters. Java has a class
java.io.InputStreamReader
that's responsible for doing exactly this, but it
wasn't being used, possibly for historical reasons. The class had been taken
from the Apache Harmony codebase, and it
correctly handled many non-ASCII UTF-8 characters. All of the characters that
worked, however, were a single char in length. The broken ones were non-BMP
characters: this was the crux of the problem.
BMP here isn't the image file
format, as I learned, but rather
the Basic Multilingual
Plane.
The BMP characters are up to 2 bytes in length (values 0x0000
-0xffff
), so
there's a good bit of coverage there. Characters from the Chinese, Thai, even
Mongolian alphabets are there, so if you're not an encoding expert, you might
be forgiven if your code only handles BMP characters. But all the same,
characters like the one in
question won't
be correctly handled by code that assumes it'll fit into two bytes.
Back to Java: a char
is defined to hold 16 bits of
data,
and the Character
type is just a wrapper around that. And as we've just
learned, that's precisely enough space to hold the full range of BMP
characters. But when it comes to non-BMP characters, which have values greater
than 0xffff
, we're out of luck.
The right approach is to bridge the language-level chars to semantic characters
by using an API like Character.codePointAt(char[], int)
to get one real character starting at a given index in the input array.
The output is an int
, which is what we need to have enough room for these
non-BMP characters. Impressively, the class with the bug was already using an array
of chars; it just turned out to only be using one element (of a 1-char array) instead
of allowing for multi-char characters. This ended up causing an error case
where the alternate InputStreamReader
would return -1 to indicate the end of
the stream. The actual fix was easy,
but as usual, understanding the context surrounding the issue was less so.
As I write this, there are currently 417,000 Google results for non-bmp
character bug
. This kind of confusion is definitely not limited to Java or
even the JVM. If you're interested in learning more about how Unicode works,
the wikipedia article on Unicode character
planes is as
good a place to start as any. There's much more complexity here than I've
described in this short post, but if this prevents at least one future mistaken
assumption about the size of a character, I'm happy.
[1] I'll settle for linking the description here since something in my toolchain, likely either my browser or my font set, doesn't support it within this blog post.