Length of a single character encoded in UTF-32

Question

Wikipedia tells me that the number of bits used by the UTF-32 encoding is 32 bits, so why does this give me a 64 bit length?

>>> Bits(bytes = 'a'.encode('utf-32')).bin '1111111111111110000000000000000001100001000000000000000000000000' >>> len(Bits(bytes = 'a'.encode('utf-32')).bin) 64

UTF-32 is supposed to be a 4-byte fixed length character set, which according to my understanding is that every character would have fixed length representing it within 32 bits, yet, the output of above code is 64. How is this?

Martijn Pieters · Accepted Answer · 2017-10-05 07:02:46Z

Encoding to UTF-32 usually includes a Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint, which is encoded to '11111111111111100000000000000000' (little-endian) in your example.

Encode to one of the two endian-specific variants Python provides ('utf-32-le' or 'utf-32-be') to get a single character:

>>> Bits(bytes = 'a'.encode('utf-32-le')).bin '01100001000000000000000000000000' >>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin) 32

The -le and -be variants let you encode or decode UTF-32 without a BOM, because you explicitly set the byte order.

Had you encoded more than one character, you'd have noticed that there are always 4 bytes more than the number of characters would require:

>>> len('abcd'.encode('utf-32')) # (BOM + 4 chars) * 4 bytes == 20 bytes 20

So, The BOM is of the same length as the number of bits in any character in the character set?
@BeshalJaenal the BOM is just another codepoint. So in UTF-32 it encodes to 32 bits, just like any other codepoint.

Collectives™ on Stack Overflow

Length of a single character encoded in UTF-32

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related