Case insensitive comparison: A lot more complicated than you might think!
Inspired by a discussion on the relative merits of scala vs. java which brought up sorting a list of strings based on the result of a call to toLowercase(), I offer you a trip around the world to convince you that string comparison is a lot more complex.
If you are a java user and don’t want to read it all, the take-away lesson is simple: You should be using java.text.Collator which is better than Collections.sort(list, String.CASE_INSENSITIVE_ORDER) which is better than just calling toLowerCase() on everything.
To explain why string comparison is so much more nuanced, we’ll visit three interesting locales, each throwing a unique wrench in the works.
Example 1: The Eszett character
In german, a certain ligature for two consecutive s characters has
been used so much for so long it’s considered its own unique
character. The Eszett or Scharfes s: ß. It shows up all over the
place in german, for example in the word street, which is straße in
german. The Eszett is somewhat unique in that it has no capital form.
Words don’t start with double s so there has never been a reason to
capitalize it. Instead, if you ask a german to write street in all
caps, they’d write STRASSE.
Uhoh. That’s not a reversible operation! The lowercase form of
STRASSE is strasse, and even knowing that the locale is german,
you can’t just guess that double-s ALWAYS becomes the ß character. You’d
need a dictionary. Arguably you’re better off doing that conversion
than not doing it if you know its german (more likely), but it won’t
be a sure thing. Nevertheless, trying to check if straße and
STRASSE are two equal strings under case insensitive comparison
should return true, and not false, but while that’s in the cards,
using .toLowerCase() as a proxy for case insensitive comparison is
never going to get us that.
Java actually gets this right, sort of. trying to uppercase straße will
in fact produce STRASSE even when just using .toUpperCase(), you don’t
have to specify Locale.GERMANY. You can experiment by using \u00DF which
is the (lowercase) Eszett. There is such a thing as a capital Eszett in unicode,
which nobody in germany uses and is clearly a silly idea: You don’t hack a living,
breathing language to get around i18n problems. If that’s feasible we might as well tell the
world to stop whining and only use english when using machines.
Here’s the problem: String.CASE_INSENSITIVE_ORDER does not in fact return
’0′ (indicating equality) when comparing straße and STRASSE, even
though every single language that uses the ß (german. That’s it!) says
those two are case-insensitive equal. Nevertheless, that does not excuse the
brokenness of the ‘toLowerCase() is an alternative for case insensitive comparison’ meme.
NB: My thanks go to dpark who spotted a problem in an earlier version of this post where I erroneously reported java did in fact use the invented uppercase variant.
Example 2: The turkish dotless/dotted i, and the somewhat famous PHP breaks on turkish computers problem
The ß vs. SS issue indicates that toLowerCase() isn’t a valid
replacement for case insensitive comparison, but it gets worse. You
can’t, in fact, say anything whatsoever about casing, or case
insensitive equality, without knowing the locale you’re operating in.
In turkish there’s not one ‘i’ but two: The dotted i and the dotless
i. Where this gets crazy is the capital forms of those. The capital
form of ‘i’ is a dotted capital I, and the capital form of dotless
‘i’ is our normal, familiar capital I, which doesn’t have a dot in it.
This even messes with kerning (fi is a common kerning where the tip of
the curvey top of the f is joined with the i’s dot. That’s not
appropriate in turkish, where that dot is important). This means “i”
and “I” are not equal in case-insensitive comparison in the Turkish
locale. toLowerCase() comparison gets this wrong, of course.
toLowerCase(new Locale("TR")) would actually get this right.
String.CASE_INSENSITIVE_ORDER gets this wrong, because you can’t give
it a locale. For natural language comparison (and why are you upper/
lowering strings in the first place, if not because youre doing
natural language comparison?), tLC(), tUC(),
String.CASE_INSENSITIVE_ORDER – are all inherently broken. Java gets
this a little right and offers Locale-based variants of tLC() and
tUC(), as I hinted at earlier.
That PHP we always like to bash on? One of the things PHP does (or perhaps did, I don’t keep up with it) wrong for the longest of times was that it would completely lose its marbles if the system’s locale was set to turkish. It would just fail to work. PHP identifiers are defined as case insensitive, and PHP implemented this by toLowerCase()ing everything, using the platform default encoding. This turns “FILE” into “fıle” which is not equal to “file”, and thus running most PHP code on a machine configured with turkish locale breaks, if the PHP is in english. This is a famous example of the universally lamented “it works for me” attitude. So, yes, making this mistake has dire consequences.
Example 3: Ascii hacks
To make matters worse still, lets think about why one would want to do a case insensitive comparison in the first place. Presumably because there’s some user input of some sort that needs to be compared, and you don’t want to bother the user with case sensitivity. However, if thats the aim, “case insensitive” is entirely the wrong idea. There are a bajillion systems around that can’t deal with unicode. For example, if many ‘name’ forms don’t even accept a dash in your last name, how many do you think accept a “ü” in it? And yet plenty of german folks are called “Müller”. The ‘fix’ these poor saps have used for years and years is to write the canonical ascii alternative for their funky character. ß becomes ss, ü because ue, etcetera. I’m betting that if you intend for “JOE” and “joe” to be equal, then you should consider “müller” and “mueller” equal too.
Whats really needed is a human inputted string comparator. Optimally
speaking such a tool will first canonicalize each string, turning for
example dotless i of any capitalization into a dotted lowercase i,
turns ß into ss, ü into ue, etcetera, and only then compare the
strings. I’m not even sure this can be done properly without knowing a
locale, but you could do a lot better than
String.CASE_INSENSITIVE_ORDER that way, and far better still versus
comparing the .toLowerCase() versions of any two given strings.
Java does in fact have something for this: java.text.Collator. This
indeed does more or less what I just described, and it is in fact what
one ought to be using. If I were you, I’d make a pmd plugin right now
that checks for usage of locale-less tLC and tUC, as well as
String.CASE_INSENSITIVE_ORDER, and flag them all as warnings.
Bonus Example: Lower case. Upper case. And Title case??
Astute folks may have observed that Character.toTitleCase() exists in the java standard library.
What’s that you ask? Well, in some languages, there are single
characters that look like 2 characters. For example the ‘dz’. That’s
one character in some languages. This is very similar to the germans
who have enshrined the much used ß ligature into a unique character.
However, unlike the ß which has pretty much lost all resemblance to
the original character, and which can never appear at the beginning of
a word, these characters are mostly just a kerned version of the
original, and CAN appear at the beginning of the word. In unicode
speak these are called digraphs. These have 3 and not 2 capitalization
forms: all lowercase, all uppercase, and the first part of the digraph
uppercased, the second part lowercased. This is what you’d use when
you want a word with just the first letter capitalized, and you use
the all-caps version only if you need an all-caps rendering of the
word (i.e. rarely). This obscure little factoid actually was useful
for me when writing lombok: The method that turns “foo” into “getFoo”
will use toTitleCase() and not toUpperCase().
And there ends our trip round the world. I hope you enjoyed it. Though now you know: That feeling of despair when someone mentions i18n? You’re entirely correct in feeling it. It’s a pain in the tusch!

