Timeline for Should UTF-16 be considered harmful?

Current License: CC BY-SA 2.5

19 events

when toggle format	what		by	license	comment
Aug 13, 2015 at 16:25	history	unlocked	Thomas Owens♦
Aug 13, 2015 at 16:05	history	locked	CommunityBot
Jun 18, 2014 at 21:05	comment	added	musiphil		@tchrist: The fact that Perl programs automatically handles Unicode (up to some degree) has little to do with the fact that Perl uses UTF-8 internally; it would still hold if Perl used UTF-16 or UTF-32 internally, as there's nothing in UTF-16 or UTF-32 that hinders correct Unicode handling.
Dec 4, 2012 at 15:31	comment	added	Roman Starkov		What about `foreach (var c in myString)` in something like .NET? How often do you remember that this is UTF-16 you're dealing with? Let me guess: almost never.
Aug 18, 2011 at 21:32	history	made wiki			Post Made Community Wiki
Aug 16, 2011 at 21:13	comment	added	Voo		@tchrist Nice, but that's not the default string implementation in perl is it? (at least not for my version!) Nobody argues that additional libraries can solve the problem correctly, but I don't see why they'd have a harder time if the codepoints were encoded with UTF-16, 8 or 32.
Aug 15, 2011 at 19:29	comment	added	tchrist		@Voo: I have no idea what you are talking about. Oh wait, maybe I do. The Unicode::GCString class has strings made of grapheme clusters instead of code points. `perl -CS -MUnicode::GCString -le 'print Unicode::GCString::->new("ab\x{63}\x{327}cde")->substr(0, 3)'` will dutifully print out abç (which is `abc\x{327}`). Isn’t that all you want? Piece of cake.
Aug 15, 2011 at 15:48	comment	added	Voo		@tchrist Great example! Because perl handles combined characters just as bad as any other well known programming language out there (its re module can handle them more sensibly, but its probably not the only one in that regard). `abU+0063U+0327cde` - a substring containing the first 3 characters really shouldn't return abc you know. So it doesn't handle "all of unicode perfectly" and the parts it does has to do with how it implemented its string library and not what encoding is used.
Aug 11, 2011 at 14:50	comment	added	tchrist		@iconiK: Don’t be silly. *UTF-16 is absolutely not the de facto* standard for processing text.** Show me a programming lanuage more suited to text processing that Perl, which has always (well, for more than a decade) used abstract characters with an underlying UTF-8 representation internally. Because of this, every Perl program automatically handles all Unicode without the user having to constantly monkey around with idiotic surrogates. The length of a string is its count in code points, not code units. Anything else is sheer stupidity putting the backwards into backwards compatibility.
Mar 17, 2010 at 14:31	comment	added	Mircea Chirea		Artyom, SO doesn't NEED to use UTF-16, since UTF-8 is the de facto standard for storage and communication of text, while UTF-16 is the de facto standard for processing of text. I don't know of any web page using UTF-16, and it wouldn't be really bold to do so, especially since a really popular language has no Unicode support: PHP (and UTF-16 isn't really easy to deal with; UTF-8 is the standard encoding in most Linux installs, where PHP is commonly run).
Jul 12, 2009 at 6:50	comment	added	Artyom		I would pic: utf-8 or utf-32 that are: variable length encoding in almost all cases (including BMP) or fixed length encoding always.
Jun 29, 2009 at 16:46	comment	added	patjbs		Not to try and flog a dead horse here, but if you shouldn't pick utf-16 as the reasonable standard, what should you pick? I'm interested in your perspective on what an acceptable alternative would be. For instance, a lot of my work involves working with ancient languages (greek, aramaic, hebrew, syriac, etc), and work a lot with these oddball unicode characters, so I'm constantly having to transition documents between utf-8, 16 and 32.
Jun 27, 2009 at 8:07	comment	added	Artyom		Actually, the problem is not with the standard. It is 100% ok. In fact, there are good implementations that work with utf-16: ICU, Java Swing etc. But, the problem is that there are too much basic bugs in processing of surragate pairs when working with utf-16, such, you should probably never pic utf-16 for internal encoding of new applications... Because there are lot of real life examples where utf-16 nature causes big troubles: even Stackoverlow can't deal with them
Jun 26, 2009 at 16:42	comment	added	patjbs		Also, it should be noted that when Joel wrote that article, the UTF-8 standard WAS 6 bytes, not 4. RFC 3629 changed the standard to 4 bytes several months AFTER he wrote the article. Like most anything on the internet, it pays to read from more than one source, and to be aware of the age of your sources. The link wasn't intended to be the "end all be all", but rather a starting point.
Jun 26, 2009 at 16:40	comment	added	Malcolm		I agree with the last edit. The simplest example: we still use C and C++ though both languages use pointers and thus are not safe.
Jun 26, 2009 at 16:21	comment	added	patjbs		My point is that those character points are designed and implemented for specific tasks. The "bugs" you describe are no different than the "bugs" one would encounter if you attempted to give input outside the scope of any application.
Jun 26, 2009 at 16:12	comment	added	Artyom		BTW: When I started writing this article I almost wanted to write "Does Joel on Softeare article of Unicode should be considered harmful" because there are many mistakes. For example: utf-8 encoding takes up to 4 characters and not 6. Also it does not distinguish between UCS-2 and UTF-16 that are really different -- and actually cause the problems I talk about.
Jun 26, 2009 at 16:12	comment	added	RichieHindle		-1: How about addressing some of Artyom's objections, rather than just patronising him?
Jun 26, 2009 at 16:09	history	answered	patjbs	CC BY-SA 2.5

toggle format