0

If I do:

print "\xE2\x82\xAC" print len("€") print len(u"€") 

I get:

€ 3 1 

But if I do:

print '\xf0\xa4\xad\xa2' print len("𤭢") print len(u"𤭢") 

I get:

𤭢 4 2 

In the second example, the len() function returned 2 instead of 1 for the one character unicode string u"𤭢".

Can someone explain to me why this is the case?

1 Answer 1

2

Python 2 can use UTF-16 as internal encoding for unicode objects (so called "narrow" build), which means 𤭢 is being encoded as two surrogates: D852 DF62. In this case, len returns the number of UTF-16 units, not the number of actual Unicode codepoints.

Python 2 can also be compiled with UTF-32 enabled for unicode (so called "wide" build), which means most unicode objects take twice as much memory, but then len(u'𤭢') == 1

Python 3's str objects since 3.3 switch on demand between ISO-8859-1, UTF-16 and UTF-32, so you'd never encounter this problem: len('𤭢') == 1.

str in Python 3.0 to 3.2 is the same as unicode in Python 2.

Sign up to request clarification or add additional context in comments.

2 Comments

How can I loop through an unicode character string that contains this kind of encoding? some thing like u"𤭢𤭢𤭢𤭢𤭢𤭢".
@lessthanl0l: Try something like this: stackoverflow.com/questions/7494064/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.