1

Wikipedia tells me that the number of bits used by the UTF-32 encoding is 32 bits, so why does this give me a 64 bit length?

>>> Bits(bytes = 'a'.encode('utf-32')).bin '1111111111111110000000000000000001100001000000000000000000000000' >>> len(Bits(bytes = 'a'.encode('utf-32')).bin) 64 

UTF-32 is supposed to be a 4-byte fixed length character set, which according to my understanding is that every character would have fixed length representing it within 32 bits, yet, the output of above code is 64. How is this?

1 Answer 1

5

Encoding to UTF-32 usually includes a Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint, which is encoded to '11111111111111100000000000000000' (little-endian) in your example.

Encode to one of the two endian-specific variants Python provides ('utf-32-le' or 'utf-32-be') to get a single character:

>>> Bits(bytes = 'a'.encode('utf-32-le')).bin '01100001000000000000000000000000' >>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin) 32 

The -le and -be variants let you encode or decode UTF-32 without a BOM, because you explicitly set the byte order.

Had you encoded more than one character, you'd have noticed that there are always 4 bytes more than the number of characters would require:

>>> len('abcd'.encode('utf-32')) # (BOM + 4 chars) * 4 bytes == 20 bytes 20 
Sign up to request clarification or add additional context in comments.

2 Comments

So, The BOM is of the same length as the number of bits in any character in the character set?
@BeshalJaenal the BOM is just another codepoint. So in UTF-32 it encodes to 32 bits, just like any other codepoint.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.