How can I use Python strings so that the same code works in 2.6, 2.7, 3.x

Question

I want to write some simple Python scripts that can be used unmodified on different Python versions, but I'm having trouble with strings...

text = get_data() phrases = [ "Soggarth Eogham O'Growney ,克尔・德怀尔", "capitis #3 病态上升涨大的繁殖性勃现", "IsoldeIsult、第一任威尔士亲王" ] for item in phrases: if item not in text: **# 3.3 ok. 2.7 UnicodeDecodeError** print ("Expected phrase '" + item + "' not found")

The code above works in 3.3. When I try to run it under 2.7 I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 27: ordinal not in range(128)

This is easily fixed by changing the first line to

text = get_data().encode('utf-8')

But then, this does not work on 3.3. Any way to make this work with one version of the source code? Python noob.

You can always check sys.version_info.major and only call encode() when it's less than 3. — martineau
– martineau, Commented Jul 30, 2013 at 0:25
See docs.python.org/dev/howto/pyporting.html and pypi.python.org/pypi/six — agf
– agf, Commented Jul 30, 2013 at 0:37
Also change second line to phrases = [ u"Soggarth Eogham O'Growney ,克尔・德怀尔", u"capitis #3 病态上升涨大的繁殖性勃现", u"IsoldeIsult、第一任威尔士亲王" ]. — martineau
– martineau, Commented Jul 30, 2013 at 0:44

Lennart Regebro · Accepted Answer · 2013-07-30 14:02:43Z

It seems that get_data() will return Unicode strings. You get the error because you concatenate the Unicode string with a 8-bit string, forcing a conversion, which will by default be done with the ASCII codec, and since the data contains non-ascii characters, this fails.

The best way to get the above code to work is to then make sure that all your strings are Unicode, by prefixing them with u"":

phrases = [ u"Soggarth Eogham O'Growney ,克尔・德怀尔", u"capitis #3 病态上升涨大的繁殖性勃现", u"IsoldeIsult、第一任威尔士亲王" ]

However, this will will only work in Python 2.x and Python 3.3. If you need to use Python 3.2 or 3.1, you need to have a method that will make it into Unicode under Python 2, but will do nothing under Python 3 (as it already is Unicode there).

Such a function is typically called u(), and you can define it like this:

import sys if sys.version < '3': import codecs def u(x): return codecs.unicode_escape_decode(x)[0] else: def u(x): return x

The 'u' in front of the strings worked for me. I have found that "from __future__ import unicode_literals" at the top of the file had the same effect.
@bluedog: Indeed it does, but it also makes it hard to have non-unicode strings, making it less useful.

Collectives™ on Stack Overflow

How can I use Python strings so that the same code works in 2.6, 2.7, 3.x

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related