1

If UTF-32 is UCS-4 restricted to 17 planes (1114111 char points) which requires 21 bits, what is the fourth byte doing?

1
  • To put it bluntly : because it can, and because it's easy to work with 32-bit values in most computers. Commented Feb 12, 2017 at 22:08

1 Answer 1

2

The fourth byte is just sitting there, occupying space (which is filled with 0s).

In theory, a 21-bit or 24-bit interchange format could have been designed. In practice, those are both quite awkward. Few (if any) modern computers have 21- or 24-bit datatypes. Since 32-bit words are easy to work with, it is quite common to use them to store numeric datatypes whose maxima are considerably less than 231-1.

Sign up to request clarification or add additional context in comments.

5 Comments

I understand that bit-aligned number could be difficult to implement on some platforms, but how is 4B datatype easier to work with than 3B datatype? sizeof char32_t could be 3, int32_t should be used to store large integral numbers, no?
(Characters are integral numbers.) Suppose it were 3 bytes (so that it would really be char24_t :) ). What would its alignment requirement be? 3 is not a valid answer; no hardware has 3-byte alignment. But if its alignment is 2 or 4, how do you arrange a vector of them so that they are all aligned? If the alignment is one, what happens on hardware which can't do unaligned loads? Hardware tends not to have 3-byte loads. How do you get a char24_t into a register if it occupies the last three bytes of a page and the next page would page fault if accessed?
FWIK memory alignment is used by SSE instructions for multiple numeric ops in single cycle. There is no such need with strings, only to sequential or random access R/W. Aligned load might be a little speed up, I can see the benefit i.e. in tokenizing short string (at a price of 25% memory). But yes, it is a reason.
The point of wide characters is that you access them as single integers, not as a string of bytes. So youbwould normally expect to be able to load a single 21-bit character code as a single atomic load.
Characters are semantically not integers, we treat their bytes differently and use different instructions for them. Integers can me multiplied, large datatype is reasonable here. Characters are meant to be accessed as a string with frequent sequential operations. To replace a character, we just need fixed width, not alignment. I understand that using 32bit register may use one extra operation to load 24bit memory space. But with longer strings (often copied), the memory tradeof for this microoptimization is often too much. I accepted your answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.