1

I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)

How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.

i = 0 filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] for i in range (len(filenames)): pathname = filenames[i] fin = open(pathname, 'r') with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f: f.write(u'File Path|Brand\n') lines = fin.read() brand_start = lines.find("Brand Title") brand_end = lines.find("/>",brand_start) brand = lines [brand_start+47:brand_end-2] f.write(u'{}|{}\n'.format(pathname[4:35],brand)) flog.close() 

I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.

2
  • You should show the full error including the traceback. Apart from anything else, that says which line the error occurred in. Commented Apr 20, 2015 at 17:42
  • 1
    nedbatchelder.com/text/unipain.html Commented Apr 20, 2015 at 17:51

2 Answers 2

1

You are mixing bytestrings with Unicode values; your fin file object produces bytestrings, and you are mixing it with Unicode here:

f.write(u'{}|{}\n'.format(pathname[4:35],brand)) 

brand is a bytestring, interpolated into a Unicode format string. Either decode brand there, or better yet, use io.open() (rather than codecs.open(), which is not as robust as the newer io module) to manage both your files:

with io.open('Assets.log', 'w', encoding='utf-8') as f,\ io.open(pathname, encoding='utf-8') as fin: f.write(u'File Path|Brand\n') lines = fin.read() brand_start = lines.find(u"Brand Title") brand_end = lines.find(u"/>", brand_start) brand = lines[brand_start + 47:brand_end - 2] f.write(u'{}|{}\n'.format(pathname[4:35], brand)) 

You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open(), so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, fixed the fundamental problem. One final issue is that it keeps overwriting the file contents, so I just get two lines of "File Path|Brand" and "SYNT0000000000001045-20150331T095311Z|Something Here|". I changed 'w' to 'a' but then the File Path|Brand is repeated on every other line. Suggestions?
@Nick: why not create the file outside of whatever loop you have then?
Also, you are correct. I am already using lxml to pass parts of the xml. This was supposed to be a quick and dirty solution, as I wasn't sure how to resolve this particular scenario with it (lots of similar children in the structure). I'll open a separate thread to get that working properly, once I've resolved my immediate need to get the information out of the files.
0

This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):

import lxml import io import os from lxml import etree from glob import glob nsmap = {'xmlns': 'thisnamespace'} i = 0 filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] with io.open(('Assets.log'),'w',encoding='utf-8') as f: f.write(u'File Path|Series|Brand\n') for i in range (len(filenames)): pathname = filenames[i] parser = lxml.etree.XMLParser() tree = lxml.etree.parse(pathname, parser) root = tree.getroot() fin = open(pathname, 'r') with io.open(pathname, encoding='utf-8') as fin: for info in root.xpath('//somepath'): series_x = info.find ('./somemorepath') series = series_x.get('Asset_Name') if series_x != None else 'Missing' lines = fin.read() brand_start = lines.find(u"sometext") brand_end = lines.find(u"/>",brand_start) brand = lines [brand_start:brand_end-2] brand = brand[(brand.rfind("/"))+1:] f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand)) f.close() 

Someone will now come along and do it all in one line!

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.