1

i want to convert the chinese character to the unicode format, like '\uXXXX' but when i use str.encode('utf-16be'), it'll show that:

b'\xOO\xOO' 

so, i write some code to perform my request as below:

data="index=索引?" print(data.encode('UTF-16LE')) def convert(s): returnCode=[] temp='' for n in s.encode('utf-16be'): if temp=='': if str.replace(hex(n),'0x','')=='0': temp='00' continue temp+=str.replace(hex(n),'0x','') else: returnCode.append(temp+str.replace(hex(n),'0x','')) temp='' return returnCode print(convert(data)) 

can someone give me suggestions to do this conversion in python 3.x?

4
  • what is the encoding of the file you define the string in? Commented Nov 26, 2013 at 9:07
  • 3
    Not sure what the problem is. UTF-16LE isn't Unicode, but it's what Microsoft calls "Unicode". Describe your goal, not your process. Commented Nov 26, 2013 at 9:09
  • "index=索引?".encode('utf-16be') gives b'\x00i\x00n\x00d\x00e\x00x\x00=}"_\x15\x00?' . What output did you want instead? Commented Nov 26, 2013 at 9:15
  • i want to convert the characters to the format '\uXXXX'. like this: index=\u0069\u006e\u0064\u0065\u0078\u003d\u7d22\u5f15\u003f Commented Nov 27, 2013 at 1:36

2 Answers 2

5

I'm not sure if I understand you well.

Unicode is like a type. In python 3, all strings are unicode, so when you write data = "index=索引?" then data is already unicode. If you want to get an alternative representation just for displaying, you could use:

def display_unicode(data): return "".join(["\\u%s" % hex(ord(l))[2:].zfill(4) for l in data]) >>> data = "index=索引?" >>> print(display_unicode(data)) \u0069\u006e\u0064\u0065\u0078\u003d\u7d22\u5f15\u003f 

Note that the string has now real backslashes and numeric representations, not unicode characters.

But there may be other alternatives

>>> data.encode('ascii', 'backslashreplace') b'index=\\u7d22\\u5f15?' >>> data.encode('unicode_escape') b'index=\\u7d22\\u5f15?' 
Sign up to request clarification or add additional context in comments.

3 Comments

OP is almost certainly using Python 3 - see print being used as a function, and a b'' literal. Also, encoding of text files doesn't necessarily follow $LANG - IDEs and text editors often let you set it in their configuration, and have their own defaults.
i use python3.3, the default coding is UTF-8
Sorry, I didn't read the question correctly. Doesn't data.encode('ascii', 'backslashreplace') do the trick?
1

Try to decode first, like: s.decode('utf-8').encode('utf-16be')?

1 Comment

The parens on print imply Python 3.x.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.