3

We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.

Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.

Example:

 >>> unicode('\xab') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128) 

But of course, this works without any problems

 >>> unicode(u'\xab') u'\xab' 

Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.

One really horrible hack that occurs is to write our own str2uni() method, something like:

def str2uni(val): r"""brute force coersion of str -> unicode""" try: return unicode(src) except UnicodeDecodeError: pass res = u'' for ch in val: res += unichr(ord(ch)) return res 

But before we do this -- wanted to see if anyone else had any insight?

UPDATED

I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.

for _,_,files in os.walk('/path/to/folder'): for fname in files: filename = unicode(fname) 

That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'

To see the error for yourself, do the following:

 >>> unicode('3\xab Floppy (A).link') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128) 

UPDATED

I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!

I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.

As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".

Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.

11
  • Can you add the portion of the code when you read the value from os.walk() and get the exception? and also the character giving you problem so we can know the encoding? Or is it \xab? Commented Jun 4, 2013 at 12:43
  • How are you calling os.walk()? I think you're confusing Unicode and UTF-8 (and other encodings)... Commented Jun 4, 2013 at 12:44
  • I've updated the question to include the specific call to os.walk(). As for which character that is generating the exception -- is is \xab Commented Jun 4, 2013 at 13:25
  • Yes, we all can see is \xab (in Exception dump) but in your file system I would love to know how the name of the file is: 3\xab Floppy (A).link substituting the \xab with the actual character :) Commented Jun 4, 2013 at 13:32
  • That is the name of the file. The filesystem is FFS on FreeBSD. According to freebsd.org/doc/en_US.ISO8859-1/books/handbook/…, section 24.3.7, "...The FreeBSD fast filesystem (FFS) is 8-bit clean, so it can be used with any single C chars character set. However, character set names are not stored in the filesystem as it is raw 8-bit and does not understand encoding order..." Commented Jun 4, 2013 at 13:37

4 Answers 4

5

I think you're confusing Unicode strings and Unicode encodings (like UTF-8).

os.walk(".") returns the filenames (and directory names etc.) as strings that are encoded in the current codepage. It will silently remove characters that are not present in your current codepage (see this question for a striking example).

Therefore, if your file/directory names contain characters outside of your encoding's range, then you definitely need to use a Unicode string to specify the starting directory, for example by calling os.walk(u"."). Then you don't need to (and shouldn't) call unicode() on the results any longer, because they already are Unicode strings.

If you don't do this, you first need to decode the filenames (as in mystring.decode("cp850")) which will give you a Unicode string:

>>> "\xab".decode("cp850") u'\xbd' 

Then you can encode that into UTF-8 or any other encoding.

>>> _.encode("utf-8") '\xc2\xbd' 

If you're still confused why unicode("\xab") throws a decoding error, maybe the following explanation helps:

"\xab" is an encoded string. Python has no way of knowing which encoding that is, but before you can convert it to Unicode, it needs to be decoded first. Without any specification from you, unicode() assumes that it is encoded in ASCII, and when it tries to decode it under this assumption, it fails because \xab isn't part of ASCII. So either you need to find out which encoding is being used by your filesystem and call unicode("\xab", encoding="cp850") or whatever, or start with Unicode strings in the first place.

Sign up to request clarification or add additional context in comments.

4 Comments

The linked question was an interesting read (about how passing unicode to os.walk() results in a list of unicode results) -- but unfornately -- it did not solve our poblem -- it's just move the error to within the python2.7/posixpath.py module. Specifically it reported error on line 71 in join path += '/' + b. The exection thrown was once again the UnicodeDecodeError (please notice -- is is a DECODE error). And again, I believe this refers to the Wiki article I linked to.
So you called os.walk(u'/path/to/folder'), and removed the call to unicode()? And then which line in your code triggered the new error?
negative. I was asked to fleshout HOW I received this result from os.walk(), and the code demonstrated it. The code as demonstrated was how we originally resulted in the UnicodeDecodeError exception.
it should not remove any bytes from a filename on POSIX (use Unicode names on Windows if it does it (but it might be just a display issue in cmd.exe)).
4
for fname in files: filename = unicode(fname) 

The second line will complaint if fname is not ASCII. If you want to convert the string to Unicode, instead of unicode(fname) you should do fname.decode('<the encoding here>').

I would suggest the encoding but you don't tell us what does \xab is in your .link file. You can search in google for the encoding anyways so it would stay like this:

for fname in files: filename = fname.decode('<encoding>') 

UPDATE: For example, IF the encoding of your filesystem's names is ISO-8859-1 then \xab char would be "«". To read it into python you should do:

for fname in files: filename = fname.decode('latin1') #which is synonym to #ISO-8859-1 

Hope this helps!

3 Comments

We landed up using our str2uni() method as described above (and edited to match our final version)...but the discussion with Paulo Bu was very insightful, and it seemed reasonable to give him the win. Thanks much!
Lol, the victory is actually solving the problem :) I'm glad you managed to make it.
@user590028: The accepted answer should solve your problem as you see it (it is not; there could bytes that can't be decoded using filesystem encoding). If you want to reward Paulo Bu for the effort; you could start a bounty.
3

As I understand it your issue is that os.walk(unicode_path) fails to decode some filenames to Unicode. This problem is fixed in Python 3.1+ (see PEP 383: Non-decodable Bytes in System Character Interfaces):

File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string.

Windows provides Unicode API to access filesystem so there shouldn't be this problem.

Python 2.7 (utf-8 filesystem on Linux):

>>> import os >>> list(os.walk(".")) [('.', [], ['\xc3('])] >>> list(os.walk(u".")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/os.py", line 284, in walk if isdir(join(top, name)): File "/usr/lib/python2.7/posixpath.py", line 71, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: \ ordinal not in range(128) 

Python 3.3:

>>> import os >>> list(os.walk(b'.')) [(b'.', [], [b'\xc3('])] >>> list(os.walk(u'.')) [('.', [], ['\udcc3('])] 

Your str2uni() function tries (it introduces ambiguous names) to solve the same issue as "surrogateescape" error handler on Python 3. Use bytestrings for filenames on Python 2 if you are expecting filenames that can't be decoded using sys.getfilesystemencoding().

Comments

2
'\xab' 

Is a byte, number 171.

u'\xab' 

Is a character, U+00AB Left-pointing double angle quotation mark («).

u'\xab' is a short-hand way of saying u'\u00ab'. It's not the same (not even the same datatype) as the byte '\xab'; it would probably have been clearer to always use the \u syntax in Unicode string literals IMO, but it's too late to fix that now.

To go from bytes to characters is known as a decode operation. To go from characters to bytes is known as an encode operation. For either direction, you need to know which encoding is used to map between the two.

>>> unicode('\xab') UnicodeDecodeError 

unicode is a character string, so there is an implicit decode operation when you pass bytes to the unicode() constructor. If you don't tell it which encoding you want you get the default encoding which is often ascii. ASCII doesn't have a meaning for byte 171 so you get an error.

>>> unicode(u'\xab') u'\xab' 

Since u'\xab' (or u'\u00ab') is already a character string, there is no implicit conversion in passing it to the unicode() constructor - you get an unchanged copy.

res = u'' for ch in val: res += unichr(ord(ch)) return res 

The encoding that maps each input byte to the Unicode character with the same ordinal value is ISO-8859-1. Consequently you could replace this loop with just:

return unicode(val, 'iso-8859-1') 

(However note that if Windows is in the mix, then the encoding you want is probably not that one but the somewhat-similar windows-1252.)

One really horrible hack that occurs is to write our own str2uni() method

This isn't generally a good idea. UnicodeErrors are Python telling you you've misunderstood something about string types; ignoring that error instead of fixing it at source means you're more likely to hide subtle failures that will bite you later.

filename = unicode(fname) 

So this would be better replaced with: filename = unicode(fname, 'iso-8859-1') if you know your filesystem is using ISO-8859-1 filenames. If your system locales are set up correctly then it should be possible to find out the encoding your filesystem is using, and go straight to that:

filename = unicode(fname, sys.getfilesystemencoding()) 

Though actually if it is set up correctly, you can skip all the encode/decode fuss by asking Python to treat filesystem paths as native Unicode instead of byte strings. You do that by passing a Unicode character string into the os filename interfaces:

for _,_,files in os.walk(u'/path/to/folder'): # note u'' string for fname in files: filename = fname # nothing more to do! 

PS. The character in 3″ Floppy should really be U+2033 Double Prime, but there is no encoding for that in ISO-8859-1. Better in the long term to use UTF-8 filesystem encoding so you can include any character.

2 Comments

Indeed - in that longer term one would want to be using Python 3, where surrogateescape works around the problem.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.