We're running into a problem (which is described http://wiki.python.org/moin/UnicodeDecodeError) -- read the second paragraph '...Paradoxically...'.
Specifically, we're trying to up-convert a string to unicode and we are receiving a UnicodeDecodeError.
Example:
>>> unicode('\xab') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128) But of course, this works without any problems
>>> unicode(u'\xab') u'\xab' Of course, this code is to demonstrate the conversion problem. In our actual code, we are not using string literals and we can cannot just pre-pend the unicode 'u' prefix, but instead we are dealing with strings returned from an os.walk(), and the file name includes the above value. Since we cannot coerce the value to a unicode without calling unicode() constructor, we're not sure how to proceed.
One really horrible hack that occurs is to write our own str2uni() method, something like:
def str2uni(val): r"""brute force coersion of str -> unicode""" try: return unicode(src) except UnicodeDecodeError: pass res = u'' for ch in val: res += unichr(ord(ch)) return res But before we do this -- wanted to see if anyone else had any insight?
UPDATED
I see everyone is getting focused on HOW I got to the example I posted, rather than the result. Sigh -- ok, here's the code that caused me to spend hours reducing the problem to the simplest form I shared above.
for _,_,files in os.walk('/path/to/folder'): for fname in files: filename = unicode(fname) That piece of code tosses a UnicodeDecodeError exception when the filename has the following value '3\xab Floppy (A).link'
To see the error for yourself, do the following:
>>> unicode('3\xab Floppy (A).link') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 1: ordinal not in range(128) UPDATED
I really appreciate everyone trying to help. And I also appreciate that most people make some pretty simple mistakes related to string/unicode handling. But I'd like to underline the reference to the UnicodeDecodeError exception. We are getting this when calling the unicode() constructor!!!
I believe the underlying cause is described in the aforementioned Wiki article http://wiki.python.org/moin/UnicodeDecodeError. Read from the second paragraph on down about how "Paradoxically, a UnicodeDecodeError may happen when encoding...". The Wiki article very accurately describes what we are experiencing -- but while it elaborates on the cuases, it makes no suggestions for resolutions.
As a matter of fact, the third paragraph starts with the following astounding admission "Unlike a similar case with UnicodeEncodeError, such a failure cannot be always avoided...".
Since I am not used to "cant get there from here" information as a developer, I thought it would be interested to cast about on Stack Overflow for the experiences of others.
\xab?os.walk()? I think you're confusing Unicode and UTF-8 (and other encodings)...3\xab Floppy (A).linksubstituting the\xabwith the actual character :)