Skip to main content
15 events
when toggle format what by license comment
Apr 1, 2014 at 16:44 comment added user7043 @DonalFellows Could you give an example? Note that I encourage constant time indexing, but with byte indices rather than code point indices. What problem requires constant time access to the ith code point, yet not to the ith grapheme cluster, and can't make do with the nth byte? UTF-8 is no panacea for bad algorithms, but I don't know any problems where it prevents good algorithms.
Apr 1, 2014 at 13:23 comment added Donal Fellows The key things: Use UTF-8 for external encodings (including database content). You can use other encodings internally to a program (and for some algorithms you get a substantive performance boost if you do so). Normalising your strings (to either NFC or NFD, but not both) can be a very good idea.
Apr 1, 2014 at 13:16 comment added Donal Fellows @delnan You oversimplify. There's a surprisingly large number of algorithms that require indexing into a string at arbitrary offsets and which don't have an easy transformation into streaming form. I say this because I maintain a library where we had to fix this; users were complaining that their code was terribly slow with large strings (yeah, because their O(N) code was now O(N²)!) Going round telling users “you're holding it wrong” when things used to work is just a way to make people upset with you.
Apr 1, 2014 at 12:55 comment added david.pfx @BartvanIngenSchenau: I agree entirely, they're easily confused. Sometimes it means the same as a code point (that's what most programmers would identify as a character in a program, whatever can fit in a 32-bit wchar_t).
Apr 1, 2014 at 10:53 comment added Bart van Ingen Schenau @david.pfx: The meaning of 'character' is context dependent. Sometimes it means the same as codepoint, but sometimes it means the same as a grapheme cluster (that what most non-programmers would identify as a character in a displayed/printed text, i.e. a base character combined with all diacritical marks)
Apr 1, 2014 at 10:43 comment added david.pfx @BartvanIngenSchenau: Ah, I see what you mean. According to en.wikipedia.org/wiki/Universal_Character_Set_characters there are 249,764 assigned code points, and the terms code point and character are more or less interchangeable. You were talking about 'characters including composed characters', of which there would seem to be arbitrarily many. Obviously the former can fit in 32 bits and the latter cannot.
Apr 1, 2014 at 10:10 comment added dj bazzie wazzie @Raphael Miedl: Thanks, that is exactly what I'm looking for. It's a character sequence to represent another character. Bart's post implied that it was on data level and that there are characters in the 21-bit Unicode range that needed multiple 32-bit code points to present 1 character. UTF-32 is therefore fixed width Unicode encoding, that was my point.
Apr 1, 2014 at 6:40 comment added Bart van Ingen Schenau @david.pfx: The Zalgo text in the link from @RaphaelMiedl is an extreme example, but that was exactly what I was referring to.`
Apr 1, 2014 at 6:37 comment added user7043 -1 for the constant time access myth. Almost all string processing be done equally fast and convenient using any arbitrary unit (e.g. bytes or code units) for indexing and sequential iteration (from start and end). And as others point out, UTF-32 only gets you O(1) access to code points, but another important (I'd say more important) notion of character are grapheme clusters, and in that regard UTF-32 gets you nowhere. See also: utf8everywhere.org/#myths
Apr 1, 2014 at 0:29 comment added AliciaBytes @djbazziewazzie actually Unicode allows unlimited usage of combinig characters which destroys any claim of guaranteed 0(1) access, even with UTF-32. Look at the interesting question: How does Zalgo text work? for one case where it's being used. Whilst one could argue if such usage were really useful, it's still valid according to the specification (at least as far as I know) and therefore 0(1) access to UTF-32 is nothing but a myth if you wanna stay universal.
Apr 1, 2014 at 0:13 comment added david.pfx @BartvanIngenSchenau: I think that is no longer correct. Since RFC 3619 in 2003 UTF-8/16/32 are all limited to 4 bytes.
Mar 31, 2014 at 20:11 vote accept Electric Coffee
Mar 31, 2014 at 10:42 comment added dj bazzie wazzie I think you mean UTF-16 Bart. UTF-32 can store all Unicode values into it's 32 bit integer. There is however an composed and decomposed notation for special characters which means I can store a character and it's diacritical as two separate values. But that's part of Unicode rather than the Unicode data encoding.
Mar 31, 2014 at 9:27 comment added Bart van Ingen Schenau Depending on the kind of processing, even UTF-32 doesn't provide a fixed-width encoding. For example, accented characters in non-Latin scripts (and uncommon accented characters in Latin scripts) are represented by a sequence of multiple Unicode codepoints (multiple UTF-32 'characters').
Mar 31, 2014 at 8:54 history answered Donal Fellows CC BY-SA 3.0