Struggling with unicode in Python

Question

I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)

How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.

i = 0 filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] for i in range (len(filenames)): pathname = filenames[i] fin = open(pathname, 'r') with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f: f.write(u'File Path|Brand\n') lines = fin.read() brand_start = lines.find("Brand Title") brand_end = lines.find("/>",brand_start) brand = lines [brand_start+47:brand_end-2] f.write(u'{}|{}\n'.format(pathname[4:35],brand)) flog.close()

I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.

You should show the full error including the traceback. Apart from anything else, that says which line the error occurred in. — Daniel Roseman
– Daniel Roseman, Commented Apr 20, 2015 at 17:42

Martijn Pieters · Accepted Answer · 2015-04-20 18:23:31Z

You are mixing bytestrings with Unicode values; your fin file object produces bytestrings, and you are mixing it with Unicode here:

f.write(u'{}|{}\n'.format(pathname[4:35],brand))

brand is a bytestring, interpolated into a Unicode format string. Either decode brand there, or better yet, use io.open() (rather than codecs.open(), which is not as robust as the newer io module) to manage both your files:

with io.open('Assets.log', 'w', encoding='utf-8') as f,\ io.open(pathname, encoding='utf-8') as fin: f.write(u'File Path|Brand\n') lines = fin.read() brand_start = lines.find(u"Brand Title") brand_end = lines.find(u"/>", brand_start) brand = lines[brand_start + 47:brand_end - 2] f.write(u'{}|{}\n'.format(pathname[4:35], brand))

You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open(), so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.

Thanks, fixed the fundamental problem. One final issue is that it keeps overwriting the file contents, so I just get two lines of "File Path|Brand" and "SYNT0000000000001045-20150331T095311Z|Something Here|". I changed 'w' to 'a' but then the File Path|Brand is repeated on every other line. Suggestions?
@Nick: why not create the file outside of whatever loop you have then?
Also, you are correct. I am already using lxml to pass parts of the xml. This was supposed to be a quick and dirty solution, as I wasn't sure how to resolve this particular scenario with it (lots of similar children in the structure). I'll open a separate thread to get that working properly, once I've resolved my immediate need to get the information out of the files.

Nick · Accepted Answer · 2015-04-21 12:51:53Z

This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):

import lxml import io import os from lxml import etree from glob import glob nsmap = {'xmlns': 'thisnamespace'} i = 0 filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] with io.open(('Assets.log'),'w',encoding='utf-8') as f: f.write(u'File Path|Series|Brand\n') for i in range (len(filenames)): pathname = filenames[i] parser = lxml.etree.XMLParser() tree = lxml.etree.parse(pathname, parser) root = tree.getroot() fin = open(pathname, 'r') with io.open(pathname, encoding='utf-8') as fin: for info in root.xpath('//somepath'): series_x = info.find ('./somemorepath') series = series_x.get('Asset_Name') if series_x != None else 'Missing' lines = fin.read() brand_start = lines.find(u"sometext") brand_end = lines.find(u"/>",brand_start) brand = lines [brand_start:brand_end-2] brand = brand[(brand.rfind("/"))+1:] f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand)) f.close()

Someone will now come along and do it all in one line!

Collectives™ on Stack Overflow

Struggling with unicode in Python

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related