python different length for same unicode

Question

I found something really weird about unicode, in my understanding, if I u"" + "string", the type will be unicode, but why are their length different?

print len(u''+'New York\u200b') 14 print type(u''+'New York\u200b') <type 'unicode'> print len(u'New York\u200b') 9 print type(u'New York\u200b') <type 'unicode'>

I also tried to get rid of \u200b, which I think it is unicode

text = u'New York\u200b' print text.encode('ascii', errors='ignore') New York text = u''+'New York\u200b' print text.encode('ascii', errors='ignore') New York\u200b

Also got different result, I am really confused! btw, I am using python 2.7, is it the time to change to 3.3?? Thanks in advance!!

in u''+'New York\u200b', 'New York\u200b' is not unicode, therefore, the \u200b is ignored. This is inconsistent with your second result, though. — njzk2
– njzk2, Commented Dec 30, 2013 at 18:42

Bakuriu · Accepted Answer · 2013-12-30 19:03:07Z

>>> (u''+'New York\u200b').encode('utf-8') 'New York\\u200b'

As you can see, since 'New York\u200b' is not a unicode string, the \u escape doesn't have any special meaning and it is interpreted literally, i.e. as the sequence of ASCII characters \ u 2 0 0 b, hence the string has length 14. The u'' only converts the string to unicode, but it does not cause a re-interpretation of the contents. Putting the u before the literal makes python interpret it as an escape, hence as a single character, hence the string is length 9.

In your second example:

text = u''+'New York\u200b' print text.encode('ascii', errors='ignore') New York\u200b

Here the .encode does not modify the characters in the string, it only converts from unicode to str.

It's probably clearer if you print the contents of the two strings

>>> print(u'New York\u200b') # note: \u200b interpreted as unicode character New York >>> print(b'New York\u200b'.decode('ascii')) New York\u200b

Or if you prefer to see an actual unicode representation try with code point 9731:

>>> print(u'New York\u2603') New York☃ >>> print(b'New York\u2603'.decode('ascii')) New York\u2603

Thanks so much! So u'' only convert whatever in the quote from str to unicode, but not the str it is appending to, and make the type to unicode for the whole thing?
@amstree Yes. When you concatenate two strings, python does not interpret the escapes. The escapes are interpreted only when creating the string literals. Concatenation operations treat all characters the same. If you want to interpret the contents of a string you should use the unicode-escape encoding. For example: b'\u2603'.decode('unicode-escape') is u'\u2603'(or '☃') while b'\u2603'.decode('ascii') is the string u'\\u2603'. The latter is a one-character string, the former is a 6 character string made of the characters \ u 2 6 0 3.

Joy Rê · Accepted Answer · 2013-12-30 18:51:54Z

'New York\u200b' is a non-unicode string of length 14.
(You append it to u'' string, but it itself is not unicode yet.)
u'New York\u200b' is a unicode string of length 9.

Collectives™ on Stack Overflow

python different length for same unicode

2 Answers 2

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Linked

Related