2

I'm trying to remove all accents from a all coding files in a folder.. I already have success in building the list of files, the problem is that when I try to use unicodedata to normalize I get the error: ** Traceback (most recent call last): File "/usr/lib/gedit-2/plugins/pythonconsole/console.py", line 336, in __run exec command in self.namespace File "", line 2, in UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 25: invalid continuation byte **

if options.remove_nonascii: nERROR = 0 print _("# Removing all acentuation from coding files in %s") % (options.folder) exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set() for dirpath, dirnames, filenames in os.walk(options.folder): for filename in (f for f in filenames if f.endswith(exts)): files.add(os.path.join(dirpath,filename)) for i in range(len(files)): f = files.pop() ; os.rename(f,f+'.BACK') with open(f,'w') as File: for line in open(f+'.BACK').readlines(): try: newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore') File.write(newLine) except UnicodeDecodeError: nERROR +=1 print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i) newLine = line File.write(newLine) 

2 Answers 2

4

It looks like the file might be encoded with the cp1252 codec:

In [18]: print('\xf3'.decode('cp1252')) ó 

unicode(line) is failing because unicode is trying to decode line with the utf-8 codec instead, hence the error UnicodeDecodeError: 'utf8' codec can't decode....

You might try decoding line with cp1252 first, then if that fails, try utf-8:

if options.remove_nonascii: nERROR = 0 print _("# Removing all acentuation from coding files in %s") % (options.folder) exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set() for dirpath, dirnames, filenames in os.walk(options.folder): for filename in (f for f in filenames if f.endswith(exts)): files.add(os.path.join(dirpath,filename)) for i,f in enumerate(files): os.rename(f,f+'.BACK') with open(f,'w') as fout: with open(f+'.BACK','r') as fin: for line fin: try: try: line=line.decode('cp1252') except UnicodeDecodeError: line=line.decode('utf-8') # If this still raises an UnicodeDecodeError, let the outer # except block handle it newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore') fout.write(newLine) except UnicodeDecodeError: nERROR +=1 print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i) newLine = line fout.write(newLine) 

By the way,

unicodedata.normalize('NFKD',line).encode('ascii','ignore') 

is a bit dangerous. For example, it removes u'ß' and some quotation marks entirely:

In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore') Out[23]: '' In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore') Out[24]: '' 

If this is a problem, then use the unidecode module:

In [25]: import unidecode In [28]: print(unidecode.unidecode(u'‘’“”ß')) ''""ss 
Sign up to request clarification or add additional context in comments.

Comments

1

You might want to specify the encoding when using unicode(line), such as unicode(line, 'utf-8')

If you don't know it, sys.getfilesystemencoding() might be your friend.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.