0

I don't understand this error code. Could anyone help me?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128) 

This is the code:

import urllib2, os, zipfile from lxml import etree def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')): buff = [] for line in data: if separator(line): if buff: yield ''.join(buff) buff[:] = [] buff.append(line) yield ''.join(buff) def first(seq,default=None): """Return the first item from sequence, seq or the default(None) value""" for item in seq: return item return default datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip" filename = datasrc.split('/')[-1] if not os.path.exists(filename): with open(filename,'wb') as file_write: r = urllib2.urlopen(datasrc) file_write.write(r.read()) zf = zipfile.ZipFile(filename) xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')]) assert xml_file is not None count = 0 for item in xmlSplitter(zf.open(xml_file)): count += 1 if count > 10: break doc = etree.XML(item) docID = first(doc.xpath('//publication-reference/document-id/doc-number/text()')) title = first(doc.xpath('//invention-title/text()')) lastName = first(doc.xpath('//addressbook/last-name/text()')) firstName = first(doc.xpath('//addressbook/first-name/text()')) street = first(doc.xpath('//addressbook/address/street/text()')) city = first(doc.xpath('//addressbook/address/city/text()')) state = first(doc.xpath('//addressbook/address/state/text()')) postcode = first(doc.xpath('//addressbook/address/postcode/text()')) country = first(doc.xpath('//addressbook/address/country/text()')) print "DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country) 

I get the code somewhere on internet, I changed only tiny of it, which was adding the Street, City, state, postcode, and country.

The XML file approximately contains of 2million lines of code, do you think that is the reason?

5
  • 1
    It means that ASCII can only handle character values below 128, and u'\xE4' is 228, which is larger. Given your tags, are you parsing an XML document? Then you could get away with putting &#xE4; in the source. Commented Apr 7, 2013 at 9:51
  • Did you mean the source of my XML? Commented Apr 7, 2013 at 9:58
  • You'll need to show the code that throws this error. Are you saving a file, concatenating strings, making string comparisons, printing to the console, etc.? Commented Apr 7, 2013 at 10:01
  • How Do I Stop The Pain? Commented Apr 7, 2013 at 10:04
  • 2
    downvoting someone that is clueless is not a good way of helping him. He would post more if he just understood the basics of what's going on, but he doesn't. Commented Apr 7, 2013 at 10:07

3 Answers 3

3

You are parsing XML, and the library already knows how to handle decoding for you. The API returns unicode objects, but you are trying to treat them as byte strings instead.

Where you call ''.format(), you are using a python bytestring instead of a unicode object, so Python has to encode the Unicode values to fit in a bytestring. To do so it can only use a default, which is ASCII.

The simple solution is to use a unicode string there instead, note the u'' string literal:

print u"DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country) 

Python will still have to encode this when printing, but at least now Python can do some auto-detection of your terminal and determine what encoding it needs to use.

You may want to read up on Python and Unicode:

Sign up to request clarification or add additional context in comments.

Comments

3

There's no such thing as plain text. Text has always an encoding, which is the way you represent a given symbol (a letter, a comma, a japanese kanji) with a series of bytes. the mapping between the symbol "code" to the bytes is called the encoding.

In python 2.7 the distinction between encoded text (the str) and a generic, unencoded text (the unicode()) is confusing at best. python 3 ditched the whole thing, and you always use unicode types by default.

In any case, what is happening there is that you are trying to read some text and put it into a string, but this text contains something that cannot be coerced to the ASCII encoding. ASCII only understand characters in the range 0-127, which is the standard set of characters (letters, numbers, symbols you use for programming). One possible extension of ASCII is latin-1 (also known as iso-8859-1), where the range 128-255 maps to latin characters such as accented a. This encoding has the advantage that you still get one byte == one character. UTF-8 is another extension of ASCII, where you release the constraint one byte == one character and allow some characters to be represented with one byte, some with two, and so on.

To solve your problem, it depends. It depends on where the problem comes in. I guess you are parsing a text file that is encoded in some encoding you don't know, which, I guess, could be either latin-1 or UTF-8. if you do so, you have to open the file specifying the encoding='utf-8' at open(), but it depends. It's hard to say from what you provide.

3 Comments

I'm parsing a XML file, on the top of the code in XML file, it says <?xml version="1.0" encoding="UTF-8"?> which I assume that the XML file is already in UTF-7 encoding style? So if anything I need to change in my code, where would it be the best to put in?
No, not UTF-7. UTF-8, which is different! Anyway, yes, the xml is encoded that way, and it does contain non-ascii characters, so you will need the proper codec.
I'm sorry, apologies for my typo, it is UTF-8. @MrLister
1

The ASCII characters range from 0 (\x00) to 127 (\x7F). Your character (\xE4=228) is bigger than the highest possible value. Therefore you have to change the codec (for example to UTF-8) to be able to encode this value.

2 Comments

@EdwardOctavianusPakpahan that depends on your current code. If you have u'\xe4'.encode('ascii'), simply change ascii to utf-8.
Im parsing a XML file, <?xml version="1.0" encoding="UTF-8"?> I think it's already in UTF-8?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.