Skip to main content
12 events
when toggle format what by license comment
Aug 13, 2015 at 16:25 history unlocked Thomas Owens
Aug 13, 2015 at 16:05 history locked CommunityBot
May 13, 2012 at 14:16 comment added Andy Dent @tchrist, I will grant you strings are virtually always processed sequentially if you include reverse iteration as "sequential" and stretch that a little further comparison of the trailing end of a string to a known string. Two very common scenarios are truncating whitespace from the end of strings and checking the file extension at the end of a path.
Aug 18, 2011 at 21:32 history made wiki Post Made Community Wiki
Aug 11, 2011 at 20:38 comment added tchrist @Ian Boyd, the need to access a string’s individual character in a random access pattern is incredibly overstated. It is about as common as wanting to compute the diagonal of a matrix of characters, which is super rare. Strings are virtually always processed sequentially, and since accessing UTF-8 char N+1 given that you are at UTF-8 char N is O(1), there is no issue. There is surpassingly little need to make random access of strings. Whether you think it is worth the storage space to go to UTF-32 instead of UTF-8 is your own opinion, but for me, it is altogether a non-issue.
Aug 11, 2011 at 15:35 comment added Ian Boyd "I do not understand advocating general use of Utf-8. It is variable length encoded (can not be accessed by index)"
Aug 11, 2011 at 14:58 comment added Kerrek SB @_tchrist: You're right, UCS-2 isn't an encoding, it's a subset. In that sense, all encodings for Unicode must by definition be able to represent all Unicode codepoints. Fair point.
Aug 11, 2011 at 14:33 comment added tchrist @Kerrek: Incorrect: UCS-2 is not a valid Unicode encoding. All UTF-* encodings by definition can represent any Unicode code point that is legal for interchange. UCS-2 can represent far fewer than that, plus a few more. Repeat: UCS-2 is not a valid Unicode encoding, any moreso than ASCII is.
Jun 9, 2011 at 11:34 comment added Kerrek SB Malcolm: Not all encodings cover all codepoints. UCS-2 is if you will the fixed-size subset of UTF-16; it only covers the BMP.
Jan 23, 2011 at 14:16 comment added Malcolm Exactly, all the encodings cover all the code points; and as for the lack of available codes, I don't see how this can be possible in forseeable future. Most supplementary planes are still unused, and even the used ones aren't full yet. So given the total sizes of the known writing systems left, it is very possible that most planes will never be used, unless they start to use code points for something different than writing systems. By the way, UTF-8 can theoretically include 6-byte sequences, so it can represent even more code points than UTF-32, but what's the point?
Jan 21, 2011 at 15:06 comment added Artyom Note: UTF-16 does covers All Unicode as Unicode Consortium decided that 10FFFF is the TOP range of Unicode and defined UTF-8 maximal 4 bytes length and explicitly excluded range 0xD800-0xDFFF from valid code points range and this range is used for creation of surrogate pairs. So any valid Unicode text can be represented with each of one of these encodings. Also about growing to future. It doesn't seems that 1 Million code points would not be enough in any far future.
Jan 21, 2011 at 12:06 history answered Pavel Machyniak CC BY-SA 2.5