3

I want to write some simple Python scripts that can be used unmodified on different Python versions, but I'm having trouble with strings...

text = get_data() phrases = [ "Soggarth Eogham O'Growney ,克尔・德怀尔", "capitis #3 病态上升涨大的繁殖性勃现", "IsoldeIsult、第一任威尔士亲王" ] for item in phrases: if item not in text: **# 3.3 ok. 2.7 UnicodeDecodeError** print ("Expected phrase '" + item + "' not found") 

The code above works in 3.3. When I try to run it under 2.7 I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 27: ordinal not in range(128) 

This is easily fixed by changing the first line to

text = get_data().encode('utf-8') 

But then, this does not work on 3.3. Any way to make this work with one version of the source code? Python noob.

3
  • 1
    You can always check sys.version_info.major and only call encode() when it's less than 3. Commented Jul 30, 2013 at 0:25
  • 1
    See docs.python.org/dev/howto/pyporting.html and pypi.python.org/pypi/six Commented Jul 30, 2013 at 0:37
  • 3
    Also change second line to phrases = [ u"Soggarth Eogham O'Growney ,克尔・德怀尔", u"capitis #3 病态上升涨大的繁殖性勃现", u"IsoldeIsult、第一任威尔士亲王" ]. Commented Jul 30, 2013 at 0:44

1 Answer 1

3

It seems that get_data() will return Unicode strings. You get the error because you concatenate the Unicode string with a 8-bit string, forcing a conversion, which will by default be done with the ASCII codec, and since the data contains non-ascii characters, this fails.

The best way to get the above code to work is to then make sure that all your strings are Unicode, by prefixing them with u"":

phrases = [ u"Soggarth Eogham O'Growney ,克尔・德怀尔", u"capitis #3 病态上升涨大的繁殖性勃现", u"IsoldeIsult、第一任威尔士亲王" ] 

However, this will will only work in Python 2.x and Python 3.3. If you need to use Python 3.2 or 3.1, you need to have a method that will make it into Unicode under Python 2, but will do nothing under Python 3 (as it already is Unicode there).

Such a function is typically called u(), and you can define it like this:

import sys if sys.version < '3': import codecs def u(x): return codecs.unicode_escape_decode(x)[0] else: def u(x): return x 
Sign up to request clarification or add additional context in comments.

2 Comments

The 'u' in front of the strings worked for me. I have found that "from __future__ import unicode_literals" at the top of the file had the same effect.
@bluedog: Indeed it does, but it also makes it hard to have non-unicode strings, making it less useful.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.