Skip to main content
Source Link
Donal Fellows
  • 6.4k
  • 28
  • 38

"Should one of the most popular encodings, UTF-16, be considered harmful?"

Quite possibly, but the alternatives should not necessarily be viewed as being much better.

The fundamental issue is that there are many different concepts about: glyphs, characters, codepoints and byte sequences. The mapping between each of these is non-trivial, even with the aid of a normalization library. (For example, some characters in European languages that are written with a Latin-based script are not written with a single Unicode codepoint. And that's at the simpler end of the complexity!) What this means is that to get everything correct is quite amazingly difficult; bizarre bugs are to be expected (and instead of just moaning about them here, tell the maintainers of the software concerned).

The only way in which UTF-16 can be considered to be harmful as opposed to, say, UTF-8 is that it has a different way of encoding code points outside the BMP (as a pair of surrogates). If code is wishing to access or iterate by code point, that means it needs to be aware of the difference. OTOH, it does mean that a substantial body of existing code that assumes "characters" can always be fit into a two-byte quantity — a fairly common, if wrong, assumption — can at least continue to work without rebuilding it all. In other words, at least you get to see those characters that aren't being handled right!

I'd turn your question on its head and say that the whole damn shebang of Unicode should be considered harmful and everyone ought to use an 8-bit encoding, except I've seen (over the past 20 years) where that leads: horrible confusion over the various ISO 8859 encodings, plus the whole set of ones used for Cyrillic, and the EBCDIC suite, and… well, Unicode for all its faults beats that. If only it wasn't such a nasty compromise between different countries' misunderstandings.

Post Made Community Wiki by Donal Fellows