I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 5: ordinal not in range(128)
How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.
i = 0 filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))] for i in range (len(filenames)): pathname = filenames[i] fin = open(pathname, 'r') with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f: f.write(u'File Path|Brand\n') lines = fin.read() brand_start = lines.find("Brand Title") brand_end = lines.find("/>",brand_start) brand = lines [brand_start+47:brand_end-2] f.write(u'{}|{}\n'.format(pathname[4:35],brand)) flog.close() I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.