How do I properly work with unicode characters in python to keep from getting errors?

Question

I'm working on a python plugin for Google Quick Search Box, and it's doing some odd things with non-ascii characters. It seems like the code works fine up until I try constructing a string containing the non-ascii characters (ü has been my test character). I am using the following code snippet for the construction, with new_task as the variable that is being input from GQSB.

the_sig = ("%sapi_key%sauth_token%smethod%sname%sparse%stimeline%s" % (api_secret, api_key, the_token, method, new_task, doParse, timeline))

It's giving me this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

I am understanding correctly, this is because I am trying to string together a unicode character inside an ascii string. Everything I could find told me to declare the encoding at the top with this:

# -*- coding: iso-8859-15 -*-

Which I have. And when I pull the code snippet that constructs the string into a new script, it works just fine. But for some reason, int he context of the rest of the code, it fails, every time. The only thing I can think of is that it is because it's inside it's own class, but that doesn't make any sense to me.

The full code can be found on GitHub here

Thanks in advance for any help. I am stumped on this one.

Max Shawabkeh · Accepted Answer · 2010-02-10 17:57:18Z

There are a few things you should do to fix this.

Convert all string literal that contain non-ASCII characters to Unicode literals. Example: u'über'.
Do intermediate processing on Unicode. In other words, if you receive an encoded string (no matter the encoding), decode it to Unicode before working on it. Example:
```
s = utf8_string.decode('utf8') + latin1_string.decode('latin1') 
```
When outputting the string or sending it somewhere, encode it with an encoding that your receiver understands. Example: send(s.encode('utf8')).

Complete example:

input1 = get_possibly_nonascii_input().decode('iso-8859-1') input2 = get_possibly_nonascii_input().decode('iso-8859-1') input3 = u'üvw' s = u'%s -> %s' % (input3, (input1 + input2).upper()) send_output(s.encode('utf8'))

Awesome. This worked. I had to decode, then reencode to utf 8 to send it to hashlib. Thanks a lot. It looks like it's working now.

Andrey Vlasovskikh · Accepted Answer · 2010-02-10 17:53:32Z

I guess you're using Python 2.x.

The file encoding declaration specifies how string literals are read by the interpreter.

You should handle all strings as unicode values, not str ones. If you read a str from the outside world, you should decode it to unicode explicitely. The same applies to outputting strings.

# -*- coding: utf-8 -*- u_dia_str = '\xc3\xbc' # str lambda_unicode = u'λ' # unicode # input value u_dia = u_dia_str.decode('utf-8') sig_unicode = u'%s%s' % (u_dia, lambda_unicode) # => u'üλ' # output value sig_str = sig_unicode.encode('utf-8') # => '\xc3\xbc\xce\xbb'

Ok, I decode the input as utf-8, and now I can get past that part. But immediately after that, I encode the string as an md5 hash with this: hashed_sig = hashlib.md5(the_sig).hexdigest() And now I get the same ascii codec error as before. Is this a limitation of hashlib? Or am I still doing something wrong?
Nevermind. Got it. I didn;t realize that I had to re-encode. Thanks for the help.

Paul D. Waite · Accepted Answer · 2010-02-10 17:49:18Z

This is a bit beyond my expertise, but I think # -*- coding: iso-8859-15 -*- at the top declares the text encoding that your Python source file is saved in.

Is it really saved in iso-8859-15?

Collectives™ on Stack Overflow

How do I properly work with unicode characters in python to keep from getting errors?

3 Answers 3

1 Comment

2 Comments

Comments

Linked

Hot Network Questions