Skip to main content
12 events
when toggle format what by license comment
Aug 13, 2015 at 16:25 history unlocked Thomas Owens
Aug 13, 2015 at 16:05 history locked CommunityBot
Jul 13, 2013 at 9:35 comment added user877329 @rmeador I would like to kill a widespread myth: UTF-8 is actually NOT compatible with any 8-bit based encoding. This is a fact that everyone in Europe should know. UTF-8 is backward compatible with US-ASCII, nothing more.
Aug 18, 2011 at 21:32 history made wiki Post Made Community Wiki
Aug 16, 2011 at 8:29 comment added Malcolm @tchrist Do you have a source for your statistics? Though if good programmers a scarce, I think this is good, because we become more valuable. :) As for the Java APIs, char-based parts may eventually get deprecated, but this is not a guarantee that they won't be used. And they definitely won't be removed for compability reasons.
Aug 15, 2011 at 19:40 comment added tchrist @Malcolm: You write “One must be a very ignorant developer to not know that UTF-16 is not fixed length.” Well, welcome to the real world! At least 19/20 of them know about it at best extremely nebulously, and cannot even process a string by code points to save their lives. That is the reality. I know, because I’ve tested them on it. Until Java deprecates the whole char botchup you will always have crappy code full of BMP idiocies. All APIs need to be by int-32 code point, and the un-Unicode-Character char-16 versions deprecated into oblivion. Really they do.
Aug 13, 2011 at 10:30 comment added Malcolm @tchrist One must be a very ignorant developer to not know that UTF-16 is not fixed length. If you start with Wikipedia, you will read the following at the very top: "It produces a variable-length result of either one or two 16-bit code units per code point". Unicode FAQ says the same: unicode.org/faq//utf_bom.html#utf16-1. I don't know, how UTF-16 can deceive anybody if it is written everywhere that it is variable length. As for the method, it was never designed for UTF-16 and shouldn't be considered Unicode, as simple as that.
Aug 11, 2011 at 14:42 comment added tchrist No, UTF-16 is not simpler. It is harder. It misleads and deceives you into thinking it is fixed width. All such code is broken and all the moreso because you don’t notice until it’s too late. CASE IN POINT: I just found yet another stupid UTF-16 bug in the Java core libraries yesterday, this time in String.equalsIgnoreCase, which was left in UCS-2 braindeath buggery, and so fails on 16/17 valid Unicode code points. How long has that code been around? No excuse for it to be buggy. UTF-16 leads to sheer stupidity and an accident waiting to happen. Run screaming from UTF-16.
Apr 2, 2010 at 15:33 comment added Malcolm Theoretically, yes. In practice there are such things as, say, UTF-16BE, which means UTF-16 in big endian without BOM. This is not some thing I made up, this is an actual encoding allowed in ID3v2.4 tags (ID3v2 tags suck, but are, unfortunately, widely used). And in such cases you have to define endianness externally, because the text itself doesn't contain BOM. UTF-8 is always written one way and it doesn't have such a problem.
Apr 2, 2010 at 14:17 comment added Joey @Malcolm: UTF-16 also has no problems with byte order as it requires a BOM which specifies the order :-)
Jun 26, 2009 at 16:57 comment added Malcolm UTF-16 is simpler for anything inside BMP, that's why it is used so widely. But I'm a fan of UTF-8 too, it also has no problems with byte order, which works to its advantage.
Jun 26, 2009 at 16:49 history answered rmeador CC BY-SA 2.5