Timeline for Is UTF-8 the final character encoding for all future time?
Current License: CC BY-SA 4.0
10 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Sep 11, 2023 at 2:02 | comment | added | Jim Balter | You said that UTF-8, not UCS-4 aka UTF-32, is "trivially compatible with Latin-1". Again, that is simply wrong. "you can skip the mapping step and just perform mechanical conversion between the 8-bit values and UTF-8" -- a mapping is a conversion and all such things are "mechanical". Sure, a Latin1 character can be converted to a code point by a byte-to-long operation, but this is irrelevant and I doubt that many implementations take this shortcut/special case. e.g., on Windows Latin1 is just one of numerous code pages. | |
| Sep 10, 2023 at 18:10 | comment | added | Adrian McCarthy | @Jim Balter: The ability to extend UTF-8 is not limited by UTF-16. As others have pointed out, Unicode is (among other things) a character repertoire and UTF-8 and -16 are ways we represent a series of characters from that repertoire. My thesis: Growth pressure could exhaust Unicode's 21-bit code point space. If that happens UTF-8 scheme can be extended (by removing artificial limitations) to handle a larger code point space while remaining compatible with today's UTF-8 data. You're right that UTF-16 cannot, but that—to me—doesn't seem relevant to the question nor my thesis. | |
| Sep 10, 2023 at 17:43 | comment | added | Adrian McCarthy | @Jim Balter: Sorry I wasn't clearer. To translate an 8-bit encoding into UTF-8 you first have to map the 8-bit values to the corresponding Unicode code points before (or as) you can convert those code points to UTF-8. Since the first 256 code points of Unicode are identical to Latin-1, you can skip the mapping step and just perform mechanical conversion between the 8-bit values and UTF-8. That's what I intended when I said trivial. I probably shouldn't have even mentioned Latin-1, but I expected somebody would nit-pick if I hadn't. | |
| Sep 9, 2023 at 13:04 | comment | added | Jim Balter | "UTF-8 can be extended further and still remain backward compatible with itself, by adding 5- and 6-byte encodings." -- That can't happen because those code points would not be representable in UTF-16, which is a very widely used encoding. From en.wikipedia.org/wiki/UTF-16 "UTF-16 will never be extended to support a larger number of code points or to support the code points that were replaced by surrogates, as this would violate the Unicode Stability Policy with respect to general category or surrogate code points." | |
| Sep 9, 2023 at 12:59 | comment | added | Jim Balter | @JoelFan The explanation is that Adrian is simply wrong. Unicode of course includes those characters, but their UTF-8 encodings are quite different from the Latin1 encodings (and people often see this when the wrong encoding is used). | |
| Mar 16, 2023 at 19:11 | comment | added | JoelFan | Can you explain what you mean by "trivially compatible with Latin-1"? | |
| Jun 16, 2020 at 11:01 | comment | added | TRiG | There are already problems representing some human languages as linear streams of code points. Look up the wrangles about representing Sutton SignWriting in Unicode. | |
| Jun 15, 2020 at 21:03 | comment | added | supercat | " It's not too hard to imagine at least some styling creeping back into the textual representation." It already has. Reading the sentence 'The Hebrew words for "one", "two" and "three" are "אחד", "שתיים", and "שלוש", respectively, would you guess that the Hebrew word for "one" is "אחד", and the word for "two" is "שתיים"? | |
| Jun 15, 2020 at 17:32 | review | First posts | |||
| Jun 15, 2020 at 20:57 | |||||
| Jun 15, 2020 at 17:28 | history | answered | Adrian McCarthy | CC BY-SA 4.0 |