Python Unicode error message

Question

I don't understand this error code. Could anyone help me?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

This is the code:

import urllib2, os, zipfile from lxml import etree def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')): buff = [] for line in data: if separator(line): if buff: yield ''.join(buff) buff[:] = [] buff.append(line) yield ''.join(buff) def first(seq,default=None): """Return the first item from sequence, seq or the default(None) value""" for item in seq: return item return default datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip" filename = datasrc.split('/')[-1] if not os.path.exists(filename): with open(filename,'wb') as file_write: r = urllib2.urlopen(datasrc) file_write.write(r.read()) zf = zipfile.ZipFile(filename) xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')]) assert xml_file is not None count = 0 for item in xmlSplitter(zf.open(xml_file)): count += 1 if count > 10: break doc = etree.XML(item) docID = first(doc.xpath('//publication-reference/document-id/doc-number/text()')) title = first(doc.xpath('//invention-title/text()')) lastName = first(doc.xpath('//addressbook/last-name/text()')) firstName = first(doc.xpath('//addressbook/first-name/text()')) street = first(doc.xpath('//addressbook/address/street/text()')) city = first(doc.xpath('//addressbook/address/city/text()')) state = first(doc.xpath('//addressbook/address/state/text()')) postcode = first(doc.xpath('//addressbook/address/postcode/text()')) country = first(doc.xpath('//addressbook/address/country/text()')) print "DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country)

I get the code somewhere on internet, I changed only tiny of it, which was adding the Street, City, state, postcode, and country.

The XML file approximately contains of 2million lines of code, do you think that is the reason?

It means that ASCII can only handle character values below 128, and u'\xE4' is 228, which is larger. Given your tags, are you parsing an XML document? Then you could get away with putting ä in the source. — Mr Lister
– Mr Lister, Commented Apr 7, 2013 at 9:51
You'll need to show the code that throws this error. Are you saving a file, concatenating strings, making string comparisons, printing to the console, etc.? — Martijn Pieters
– Martijn Pieters, Commented Apr 7, 2013 at 10:01
downvoting someone that is clueless is not a good way of helping him. He would post more if he just understood the basics of what's going on, but he doesn't. — Stefano Borini
– Stefano Borini, Commented Apr 7, 2013 at 10:07

Martijn Pieters · Accepted Answer · 2013-04-07 10:39:09Z

You are parsing XML, and the library already knows how to handle decoding for you. The API returns unicode objects, but you are trying to treat them as byte strings instead.

Where you call ''.format(), you are using a python bytestring instead of a unicode object, so Python has to encode the Unicode values to fit in a bytestring. To do so it can only use a default, which is ASCII.

The simple solution is to use a unicode string there instead, note the u'' string literal:

print u"DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country)

Python will still have to encode this when printing, but at least now Python can do some auto-detection of your terminal and determine what encoding it needs to use.

You may want to read up on Python and Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Stefano Borini · Accepted Answer · 2013-04-07 10:03:13Z

There's no such thing as plain text. Text has always an encoding, which is the way you represent a given symbol (a letter, a comma, a japanese kanji) with a series of bytes. the mapping between the symbol "code" to the bytes is called the encoding.

In python 2.7 the distinction between encoded text (the str) and a generic, unencoded text (the unicode()) is confusing at best. python 3 ditched the whole thing, and you always use unicode types by default.

In any case, what is happening there is that you are trying to read some text and put it into a string, but this text contains something that cannot be coerced to the ASCII encoding. ASCII only understand characters in the range 0-127, which is the standard set of characters (letters, numbers, symbols you use for programming). One possible extension of ASCII is latin-1 (also known as iso-8859-1), where the range 128-255 maps to latin characters such as accented a. This encoding has the advantage that you still get one byte == one character. UTF-8 is another extension of ASCII, where you release the constraint one byte == one character and allow some characters to be represented with one byte, some with two, and so on.

To solve your problem, it depends. It depends on where the problem comes in. I guess you are parsing a text file that is encoded in some encoding you don't know, which, I guess, could be either latin-1 or UTF-8. if you do so, you have to open the file specifying the encoding='utf-8' at open(), but it depends. It's hard to say from what you provide.

I'm parsing a XML file, on the top of the code in XML file, it says <?xml version="1.0" encoding="UTF-8"?> which I assume that the XML file is already in UTF-7 encoding style? So if anything I need to change in my code, where would it be the best to put in?
No, not UTF-7. UTF-8, which is different! Anyway, yes, the xml is encoded that way, and it does contain non-ascii characters, so you will need the proper codec.

pascalh · Accepted Answer · 2013-04-07 09:51:18Z

1

The ASCII characters range from 0 (\x00) to 127 (\x7F). Your character (\xE4=228) is bigger than the highest possible value. Therefore you have to change the codec (for example to UTF-8) to be able to encode this value.

answered Apr 7, 2013 at 9:51

pascalh

5,8764 gold badges34 silver badges46 bronze badges

2 Comments

pascalh Over a year ago

@EdwardOctavianusPakpahan that depends on your current code. If you have u'\xe4'.encode('ascii'), simply change ascii to utf-8.

Gold Skull with Pattern Over a year ago

Im parsing a XML file, <?xml version="1.0" encoding="UTF-8"?> I think it's already in UTF-8?

Collectives™ on Stack Overflow

Python Unicode error message

3 Answers 3

Comments

3 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

2 Comments

Related