Python: how to convert string with \unnnn escapes to Unicode string? [duplicate]

Question

I am using Python and unfortunately my code needs to convert a string that represents Unicode characters in the string as \u1234 escapes into the original string, like:

Here is the code string that I got from other code:

\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5

I need to convert it back to the original string. How to do that?

Can you please explain why you want to convert to a string? Because that cannot be done, but you can work around it by treating the unicode string as a unicode string. — C0deH4cker
– C0deH4cker, Commented Jul 7, 2012 at 14:25
how to do that? image some one passes me a variable a = '\u6b22\u8fce\u63d0\u4ea4\u5fae' and ask me to convert it to the original utf string(far east characters) — Bin Chen
– Bin Chen, Commented Jul 7, 2012 at 14:26
Where did that string come from? There are many, many different syntaxes that use \u escapes, and you need to choose the right one to avoid inconsistent results with any other escapes that are in there. JSON is one common possibility, but if that's what you've got you will need to use a JSON decoder rather than unicode-escape which is specific to Python Unicode string literals. — bobince
– bobince, Commented Jul 8, 2012 at 8:25

Mark Tolonen · Accepted Answer · 2012-07-07 16:43:53Z

I think this is what you want. It isn't UTF-8 byte string (well, technically it is, but only because ASCII is a subset of UTF-8).

>>> s='\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5' >>> print s.decode('unicode-escape') 欢迎提交微博搜索使用反馈，请直接

FYI, this is UTF-8:

>>> s.decode('unicode-escape').encode('utf8')

'\xe6\xac\xa2\xe8\xbf\x8e\xe6\x8f\x90\xe4\xba\xa4\xe5\xbe\xae\xe5\x8d\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe4\xbd\xbf\xe7\x94\xa8\xe5\x8f\x8d\xe9\xa6\x88\xef\xbc\x8c\xe8\xaf\xb7\xe7\x9b\xb4\xe6\x8e\xa5'

Yes, it was the first line with a u in front. I deleted one but not the other in my edit.

Joey · Accepted Answer · 2012-07-07 15:16:37Z

If I understand the question, we have a simple byte string, having Unicode escaping in it, or something like that:

a = '\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5' In [122]: a Out[122]: '\\u6b22\\u8fce\\u63d0\\u4ea4\\u5fae\\u535a\\u641c\\u7d22\\u4f7f\\u7528\\u53cd\\u9988\\uff0c\\u8bf7\\u76f4\\u63a5'

So we need to manually parse the unicode values from the string, using the Unicode code points:

\u6b22 => unichr(0x6b22) # 欢

or finally:

print "".join([unichr(int('0x'+a[i+2:i+6], 16)) for i in range(0, len(a), 6)]) 欢迎提交微博搜索使用反馈，请直接

Surya Kasturi · Accepted Answer · 2012-07-07 14:33:07Z

Mark Pilgrim had explained this in his book. Take a look

http://www.diveintopython.net/xml_processing/unicode.html

>>> s = u"\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5" >>> print s.encode("utf-8") >>> 欢迎提交微博搜索使用反馈，请直接

the string s that is passed to my code doesn't have u'' in front of it, it's a variable, try to replace the string to a variable b you will find your solution can't work syntactically.

Collectives™ on Stack Overflow

Python: how to convert string with \unnnn escapes to Unicode string? [duplicate]

3 Answers 3

2 Comments

Comments

1 Comment

Linked

Hot Network Questions