Python length of unicode string confusion

Question

There's been quite some help around this already, but I am still confused.

I have a unicode string like this:

title = u'😉test' title_length = len(title) #5

But! I need len(title) to be 6. The clients expect it to be 6 because they seem to count in a different way than I do on the backend.

As a workaround I have written this little helper, but I am sure it can be improved (with enough knowledge about encodings) or perhaps it's even wrong.

title_length = len(title) + repr(title).count('\\U') #6

1. Is there a better way of getting the length to be 6? :-)

I assume me (Python) is counting the number of unicode characters which is 5. The clients are counting the number of bytes?

2. Would my logic break for other unicode characters that need 4 bytes for example?

Running Python 2.7 ucs4.

When I tried running those two lines, it showed the length as 6. — ssundarraj
– ssundarraj, Commented Jun 11, 2015 at 8:46
@ssundarraj: see my answer; you are running a Python 2 UCS2 build. Use Python 3.3 or up, or get yourself a UCS4 build. — Martijn Pieters
– Martijn Pieters, Commented Jun 11, 2015 at 8:55

Martijn Pieters · Accepted Answer · 2015-06-11 09:29:30Z

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

You can encode your text to UTF-16 and divide the number of bytes by two (each UTF-16 code unit is 2 bytes). Pick the utf-16-le or utf-16-be variant to not include a BOM in the length:

title = u'😉test' len_in_codeunits = len(title.encode('utf-16-le')) // 2

If you are using Python 2 (and judging by the u prefix to the string you may well be), take into account that there are 2 different flavours of Python, depending on how you built it. Depending on a build-time configuration switch you'll either have a UCS-2 or UCS-4 build; the former uses surrogates internally too, and your title value length will be 6 there as well. See Python returns length of 2 for single Unicode character string.

Clients are indeed Java, how did you know they are counting UTF-16 surrogate pairs? Couldn't it be UTF-8 or UTF-32 too? Can I be sure they are always counting 2 codeunits, depeding on the codepoint it could be more? Your method of counting looks indeed more elegant. :-) Thanks a lot for this great explanation!
The counts would be wildly different if they were counting code units in a different UTF codec (8 in UTF-8 and 5 for UTF-32). Yes, UTF-16 either uses one or two code units, always, see the Wikipedia link in my answer. Java code can be fixed; see JSR-204 and the codePointCount() method.

Collectives™ on Stack Overflow

Python length of unicode string confusion

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related