I don't understand this error code. Could anyone help me?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128) This is the code:
import urllib2, os, zipfile from lxml import etree def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')): buff = [] for line in data: if separator(line): if buff: yield ''.join(buff) buff[:] = [] buff.append(line) yield ''.join(buff) def first(seq,default=None): """Return the first item from sequence, seq or the default(None) value""" for item in seq: return item return default datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip" filename = datasrc.split('/')[-1] if not os.path.exists(filename): with open(filename,'wb') as file_write: r = urllib2.urlopen(datasrc) file_write.write(r.read()) zf = zipfile.ZipFile(filename) xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')]) assert xml_file is not None count = 0 for item in xmlSplitter(zf.open(xml_file)): count += 1 if count > 10: break doc = etree.XML(item) docID = first(doc.xpath('//publication-reference/document-id/doc-number/text()')) title = first(doc.xpath('//invention-title/text()')) lastName = first(doc.xpath('//addressbook/last-name/text()')) firstName = first(doc.xpath('//addressbook/first-name/text()')) street = first(doc.xpath('//addressbook/address/street/text()')) city = first(doc.xpath('//addressbook/address/city/text()')) state = first(doc.xpath('//addressbook/address/state/text()')) postcode = first(doc.xpath('//addressbook/address/postcode/text()')) country = first(doc.xpath('//addressbook/address/country/text()')) print "DocID: {0}\nTitle: {1}\nLast Name: {2}\nFirst Name: {3}\nStreet: {4}\ncity: {5}\nstate: {6}\npostcode: {7}\ncountry: {8}\n".format(docID,title,lastName,firstName,street,city,state,postcode,country) I get the code somewhere on internet, I changed only tiny of it, which was adding the Street, City, state, postcode, and country.
The XML file approximately contains of 2million lines of code, do you think that is the reason?
u'\xE4'is 228, which is larger. Given your tags, are you parsing an XML document? Then you could get away with puttingäin the source.