Python - Python 3.1 can't seem to handle UTF-16 encoded files?

Question

I'm trying to run some code to simply go through a bunch of files and write those that happen to be .txt files into the same file, removing all the spaces. Here's some simple code that should do the trick:

for subdir, dirs, files in os.walk(rootdir): for file in files: if '.txt' in file: f = open(subdir+'/'+file, 'r') line = f.readline() while line: line2 = line.split() if line2: output_file.write(" ".join(line2)+'\n') line = f.readline() f.close()

But instead, I get the following error:

File "/usr/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 0: unexpected code byte

It turns out these .txt files are all in UTF-16 (according to FireFox, at any rate). I thought Python 3.x was supposed to be able to handle any sort of character encoding??

Best, Georgina

Ok, oneliner: output_file.write(input_file.read().decode('utf-16').replace(u" ", u"").encode('desired encoding')) — janislaw
– janislaw, Commented Apr 13, 2011 at 10:38

filmor · Accepted Answer · 2011-04-13 05:37:40Z

8

Use open(bla, 'r', encoding="utf-16").

answered Apr 13, 2011 at 5:37

filmor

32.6k6 gold badges53 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Georgina Over a year ago

Woops--thanks! Just as you posted this, iI discovered this great post: stackoverflow.com/questions/3140010/…

Community · Accepted Answer · 2017-05-23 12:14:16Z

There are various utf-16 encodings.

utf-16-be big endian no BOM
utf-16-le little endian no BOM
utf-16 little endian + BOM

Examples:

Python 3.2 (r32:88452, Feb 20 2011, 11:12:31) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> a = 'a'.encode('utf-16') >>> a b'\xff\xfea\x00' >>> a.decode('utf-16') 'a' >>> a = 'a'.encode('utf-16-le') >>> a b'a\x00' >>> a.decode('utf-16-le') 'a' >>> a = 'a'.encode('utf-16-be') >>> a b'\x00a' >>> a.decode('utf-16-be') 'a'

You can use these encodings as suggested by @filmor's answer

Collectives™ on Stack Overflow

Python - Python 3.1 can't seem to handle UTF-16 encoded files?

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related