Timeline for Should UTF-16 be considered harmful?

Current License: CC BY-SA 2.5

12 events

when toggle format	what		by	license	comment
Aug 13, 2015 at 16:25	history	unlocked	Thomas Owens♦
Aug 13, 2015 at 16:05	history	locked	CommunityBot
May 13, 2012 at 14:16	comment	added	Andy Dent		@tchrist, I will grant you strings are virtually always processed sequentially if you include reverse iteration as "sequential" and stretch that a little further comparison of the trailing end of a string to a known string. Two very common scenarios are truncating whitespace from the end of strings and checking the file extension at the end of a path.
Aug 18, 2011 at 21:32	history	made wiki			Post Made Community Wiki
Aug 11, 2011 at 20:38	comment	added	tchrist		@Ian Boyd, the need to access a string’s individual character in a random access pattern is incredibly overstated. It is about as common as wanting to compute the diagonal of a matrix of characters, which is super rare. Strings are virtually always processed sequentially, and since accessing UTF-8 char N+1 given that you are at UTF-8 char N is O(1), there is no issue. There is surpassingly little need to make random access of strings. Whether you think it is worth the storage space to go to UTF-32 instead of UTF-8 is your own opinion, but for me, it is altogether a non-issue.
Aug 11, 2011 at 15:35	comment	added	Ian Boyd		"I do not understand advocating general use of Utf-8. It is variable length encoded (can not be accessed by index)"
Aug 11, 2011 at 14:58	comment	added	Kerrek SB		@_tchrist: You're right, UCS-2 isn't an encoding, it's a subset. In that sense, all encodings for Unicode must by definition be able to represent all Unicode codepoints. Fair point.
Aug 11, 2011 at 14:33	comment	added	tchrist		@Kerrek: Incorrect: UCS-2 is not a valid Unicode encoding. All UTF-* encodings by definition can represent any Unicode code point that is legal for interchange. UCS-2 can represent far fewer than that, plus a few more. Repeat: UCS-2 is not a valid Unicode encoding, any moreso than ASCII is.
Jun 9, 2011 at 11:34	comment	added	Kerrek SB		Malcolm: Not all encodings cover all codepoints. UCS-2 is if you will the fixed-size subset of UTF-16; it only covers the BMP.
Jan 23, 2011 at 14:16	comment	added	Malcolm		Exactly, all the encodings cover all the code points; and as for the lack of available codes, I don't see how this can be possible in forseeable future. Most supplementary planes are still unused, and even the used ones aren't full yet. So given the total sizes of the known writing systems left, it is very possible that most planes will never be used, unless they start to use code points for something different than writing systems. By the way, UTF-8 can theoretically include 6-byte sequences, so it can represent even more code points than UTF-32, but what's the point?
Jan 21, 2011 at 15:06	comment	added	Artyom		Note: UTF-16 does covers All Unicode as Unicode Consortium decided that 10FFFF is the TOP range of Unicode and defined UTF-8 maximal 4 bytes length and explicitly excluded range 0xD800-0xDFFF from valid code points range and this range is used for creation of surrogate pairs. So any valid Unicode text can be represented with each of one of these encodings. Also about growing to future. It doesn't seems that 1 Million code points would not be enough in any far future.
Jan 21, 2011 at 12:06	history	answered	Pavel Machyniak	CC BY-SA 2.5