Skip to main content
10 events
when toggle format what by license comment
Sep 11, 2023 at 2:02 comment added Jim Balter You said that UTF-8, not UCS-4 aka UTF-32, is "trivially compatible with Latin-1". Again, that is simply wrong. "you can skip the mapping step and just perform mechanical conversion between the 8-bit values and UTF-8" -- a mapping is a conversion and all such things are "mechanical". Sure, a Latin1 character can be converted to a code point by a byte-to-long operation, but this is irrelevant and I doubt that many implementations take this shortcut/special case. e.g., on Windows Latin1 is just one of numerous code pages.
Sep 10, 2023 at 18:10 comment added Adrian McCarthy @Jim Balter: The ability to extend UTF-8 is not limited by UTF-16. As others have pointed out, Unicode is (among other things) a character repertoire and UTF-8 and -16 are ways we represent a series of characters from that repertoire. My thesis: Growth pressure could exhaust Unicode's 21-bit code point space. If that happens UTF-8 scheme can be extended (by removing artificial limitations) to handle a larger code point space while remaining compatible with today's UTF-8 data. You're right that UTF-16 cannot, but that—to me—doesn't seem relevant to the question nor my thesis.
Sep 10, 2023 at 17:43 comment added Adrian McCarthy @Jim Balter: Sorry I wasn't clearer. To translate an 8-bit encoding into UTF-8 you first have to map the 8-bit values to the corresponding Unicode code points before (or as) you can convert those code points to UTF-8. Since the first 256 code points of Unicode are identical to Latin-1, you can skip the mapping step and just perform mechanical conversion between the 8-bit values and UTF-8. That's what I intended when I said trivial. I probably shouldn't have even mentioned Latin-1, but I expected somebody would nit-pick if I hadn't.
Sep 9, 2023 at 13:04 comment added Jim Balter "UTF-8 can be extended further and still remain backward compatible with itself, by adding 5- and 6-byte encodings." -- That can't happen because those code points would not be representable in UTF-16, which is a very widely used encoding. From en.wikipedia.org/wiki/UTF-16 "UTF-16 will never be extended to support a larger number of code points or to support the code points that were replaced by surrogates, as this would violate the Unicode Stability Policy with respect to general category or surrogate code points."
Sep 9, 2023 at 12:59 comment added Jim Balter @JoelFan The explanation is that Adrian is simply wrong. Unicode of course includes those characters, but their UTF-8 encodings are quite different from the Latin1 encodings (and people often see this when the wrong encoding is used).
Mar 16, 2023 at 19:11 comment added JoelFan Can you explain what you mean by "trivially compatible with Latin-1"?
Jun 16, 2020 at 11:01 comment added TRiG There are already problems representing some human languages as linear streams of code points. Look up the wrangles about representing Sutton SignWriting in Unicode.
Jun 15, 2020 at 21:03 comment added supercat " It's not too hard to imagine at least some styling creeping back into the textual representation." It already has. Reading the sentence 'The Hebrew words for "one", "two" and "three" are "אחד", "שתיים", and "שלוש", respectively, would you guess that the Hebrew word for "one" is "אחד", and the word for "two" is "שתיים"?
Jun 15, 2020 at 17:32 review First posts
Jun 15, 2020 at 20:57
Jun 15, 2020 at 17:28 history answered Adrian McCarthy CC BY-SA 4.0