Why UTF-32 uses four bytes?

Question

If UTF-32 is UCS-4 restricted to 17 planes (1114111 char points) which requires 21 bits, what is the fourth byte doing?

To put it bluntly : because it can, and because it's easy to work with 32-bit values in most computers. — Daniel Kamil Kozar
– Daniel Kamil Kozar, Commented Feb 12, 2017 at 22:08

rici · Accepted Answer · 2017-02-12 22:05:41Z

2

The fourth byte is just sitting there, occupying space (which is filled with 0s).

In theory, a 21-bit or 24-bit interchange format could have been designed. In practice, those are both quite awkward. Few (if any) modern computers have 21- or 24-bit datatypes. Since 32-bit words are easy to work with, it is quite common to use them to store numeric datatypes whose maxima are considerably less than 2³¹-1.

answered Feb 12, 2017 at 22:05

rici

243k30 gold badges263 silver badges364 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jan Turoň Over a year ago

I understand that bit-aligned number could be difficult to implement on some platforms, but how is 4B datatype easier to work with than 3B datatype? sizeof char32_t could be 3, int32_t should be used to store large integral numbers, no?

rici Over a year ago

(Characters are integral numbers.) Suppose it were 3 bytes (so that it would really be char24_t :) ). What would its alignment requirement be? 3 is not a valid answer; no hardware has 3-byte alignment. But if its alignment is 2 or 4, how do you arrange a vector of them so that they are all aligned? If the alignment is one, what happens on hardware which can't do unaligned loads? Hardware tends not to have 3-byte loads. How do you get a char24_t into a register if it occupies the last three bytes of a page and the next page would page fault if accessed?

Jan Turoň Over a year ago

FWIK memory alignment is used by SSE instructions for multiple numeric ops in single cycle. There is no such need with strings, only to sequential or random access R/W. Aligned load might be a little speed up, I can see the benefit i.e. in tokenizing short string (at a price of 25% memory). But yes, it is a reason.

rici Over a year ago

The point of wide characters is that you access them as single integers, not as a string of bytes. So youbwould normally expect to be able to load a single 21-bit character code as a single atomic load.

Jan Turoň Over a year ago

Characters are semantically not integers, we treat their bytes differently and use different instructions for them. Integers can me multiplied, large datatype is reasonable here. Characters are meant to be accessed as a string with frequent sequential operations. To replace a character, we just need fixed width, not alignment. I understand that using 32bit register may use one extra operation to load 24bit memory space. But with longer strings (often copied), the memory tradeof for this microoptimization is often too much. I accepted your answer.

Collectives™ on Stack Overflow

Why UTF-32 uses four bytes?

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related