4

I am using Python and unfortunately my code needs to convert a string that represents Unicode characters in the string as \u1234 escapes into the original string, like:

Here is the code string that I got from other code:

\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5 

I need to convert it back to the original string. How to do that?

3
  • Can you please explain why you want to convert to a string? Because that cannot be done, but you can work around it by treating the unicode string as a unicode string. Commented Jul 7, 2012 at 14:25
  • how to do that? image some one passes me a variable a = '\u6b22\u8fce\u63d0\u4ea4\u5fae' and ask me to convert it to the original utf string(far east characters) Commented Jul 7, 2012 at 14:26
  • Where did that string come from? There are many, many different syntaxes that use \u escapes, and you need to choose the right one to avoid inconsistent results with any other escapes that are in there. JSON is one common possibility, but if that's what you've got you will need to use a JSON decoder rather than unicode-escape which is specific to Python Unicode string literals. Commented Jul 8, 2012 at 8:25

3 Answers 3

17

I think this is what you want. It isn't UTF-8 byte string (well, technically it is, but only because ASCII is a subset of UTF-8).

>>> s='\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5' >>> print s.decode('unicode-escape') 欢迎提交微博搜索使用反馈,请直接 

FYI, this is UTF-8:

>>> s.decode('unicode-escape').encode('utf8') 

'\xe6\xac\xa2\xe8\xbf\x8e\xe6\x8f\x90\xe4\xba\xa4\xe5\xbe\xae\xe5\x8d\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe4\xbd\xbf\xe7\x94\xa8\xe5\x8f\x8d\xe9\xa6\x88\xef\xbc\x8c\xe8\xaf\xb7\xe7\x9b\xb4\xe6\x8e\xa5'

Sign up to request clarification or add additional context in comments.

2 Comments

Isn't there output missing from the second line?
Yes, it was the first line with a u in front. I deleted one but not the other in my edit.
2

If I understand the question, we have a simple byte string, having Unicode escaping in it, or something like that:

a = '\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5' In [122]: a Out[122]: '\\u6b22\\u8fce\\u63d0\\u4ea4\\u5fae\\u535a\\u641c\\u7d22\\u4f7f\\u7528\\u53cd\\u9988\\uff0c\\u8bf7\\u76f4\\u63a5' 

So we need to manually parse the unicode values from the string, using the Unicode code points:

\u6b22 => unichr(0x6b22) # 欢 

or finally:

print "".join([unichr(int('0x'+a[i+2:i+6], 16)) for i in range(0, len(a), 6)]) 欢迎提交微博搜索使用反馈,请直接 

Comments

-1

Mark Pilgrim had explained this in his book. Take a look

http://www.diveintopython.net/xml_processing/unicode.html

>>> s = u"\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5" >>> print s.encode("utf-8") >>> 欢迎提交微博搜索使用反馈,请直接 

1 Comment

the string s that is passed to my code doesn't have u'' in front of it, it's a variable, try to replace the string to a variable b you will find your solution can't work syntactically.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.