len() with unicode strings

Question

If I do:

print "\xE2\x82\xAC" print len("€") print len(u"€")

I get:

€ 3 1

But if I do:

print '\xf0\xa4\xad\xa2' print len("𤭢") print len(u"𤭢")

I get:

𤭢 4 2

In the second example, the len() function returned 2 instead of 1 for the one character unicode string u"𤭢".

Can someone explain to me why this is the case?

Karol S · Accepted Answer · 2014-07-19 23:39:34Z

Python 2 can use UTF-16 as internal encoding for unicode objects (so called "narrow" build), which means 𤭢 is being encoded as two surrogates: D852 DF62. In this case, len returns the number of UTF-16 units, not the number of actual Unicode codepoints.

Python 2 can also be compiled with UTF-32 enabled for unicode (so called "wide" build), which means most unicode objects take twice as much memory, but then len(u'𤭢') == 1

Python 3's str objects since 3.3 switch on demand between ISO-8859-1, UTF-16 and UTF-32, so you'd never encounter this problem: len('𤭢') == 1.

str in Python 3.0 to 3.2 is the same as unicode in Python 2.

How can I loop through an unicode character string that contains this kind of encoding? some thing like u"𤭢𤭢𤭢𤭢𤭢𤭢".
@lessthanl0l: Try something like this: stackoverflow.com/questions/7494064/…

Collectives™ on Stack Overflow

len() with unicode strings

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related