Using UTF-16 encoding in python

Question

I'm trying to encode non-ascii characters in python using utf-16-le, and here's the snippet of the code for this:

import os import sys def run(): print sys.getdefaultencoding() reload(sys) sys.setdefaultencoding('utf-16-le') print sys.getdefaultencoding() test_dir = unit_test_utils.get_test_dir("utkarsh") dir_name_1 = '東京' .... .... if __name__ == '__main__': run()

When this code is run, this is the error seen:

# /u/bin/python-qs /root/python/tests/abc.py -c  /root/test.conf    File "/root/python/tests/abc.py", line 27 SyntaxError: Non-ASCII character '\xe6' in file /root/python/tests/abc.py on line 27, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

How can this be fixed? I tried adding this line to the beginning of the file, but to no avail:

# -*- coding: utf-16-le -*-

The error this time around was:

# /u/bin/python-qs /root/python/tests/abc.py -c /root/test.conf File "/root/python/tests/abc.py", line 2 import os import sys ... ... if __name__ == '__main__': run() ^ SyntaxError: invalid syntax

Edit:

Line 27: dir_name_1 = '東京'

Can you include a complete example that produces this error? — user5547025
– user5547025, Commented Apr 21, 2016 at 8:27
Is your source code written in utf-16 encoding ? Check with file abc.py. — Michel Billaud
– Michel Billaud, Commented Apr 21, 2016 at 8:29
Do not use sys.setdefaultencoding(). You are trying to auto-set broken bones there rather than not break your bones in the first place. Read nedbatchelder.com/text/unipain.html instead and handle Unicode properly. — Martijn Pieters
– Martijn Pieters, Commented Apr 21, 2016 at 8:36
Note that Python source code encoding can't handle anything but single-byte and UTF-8 codecs. UTF-16 and UTF-32 are not supported. — Martijn Pieters
– Martijn Pieters, Commented Apr 21, 2016 at 8:38
If you are handling data, there is no need to declare a source code encoding. That is only needed if you need to specify non-ASCII string literals in your code, but you could just use \xhh or \uhhhh escape sequences in those literals instead. A source code encoding declaration won't help with encoding and decoding data in your program. — Martijn Pieters
– Martijn Pieters, Commented Apr 21, 2016 at 8:40

Serge Ballesta · Accepted Answer · 2016-04-21 08:58:31Z

All is (almost) fine in the code you show. You have a source file encoded in utf-8 (as stated by your comment on the result of the file command), so the line

dir_name_1 = '東京'

is in fact (as you are using a Python 2.x):

dir_name_1 = '\xe6\x9d\xb1\xe4\xba\xac' # utf8 for 東京

The only problem is that on line 27 (that you failed to show) you are doing something with that utf8 encoded string, probably trying to convert it (explicitely or implicitely) to unicode without specifying any encoding, so ascii is taken as default and error is then normal since \xe6 in not in ascii range. You should explicitely decode the string with dir_name_1.decode('utf8')

Line 27: dir_name_1 = '東京'. I've updated the post with this.
How do I get the characters UTF-16 encoded, if it is possible? One of the comments says that UTF-16 and UTF-32 are not supported for source editor.
Can I add them to the file and read them specifying that the contents are UTF-16 encoded?
@Maddy If the file is UTF8-encoded, you can get utf16-le encoded bytes that way: dir_name_1 = '東京'; utf16_dirname_1 = dir_name_1.decode('utf8').encode(utf16-le)
@Maddy: I forgot to tell you to declare the utf8 encoding. The python file must start with # -*- coding: utf8-*- or if first line declares a shell (such as #! /usr/bin/env python, the coding line must be the second one.

Collectives™ on Stack Overflow

Using UTF-16 encoding in python

1 Answer 1

10 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Related