Python - Remove accents from all files in folder

Question

I'm trying to remove all accents from a all coding files in a folder.. I already have success in building the list of files, the problem is that when I try to use unicodedata to normalize I get the error: ** Traceback (most recent call last): File "/usr/lib/gedit-2/plugins/pythonconsole/console.py", line 336, in __run exec command in self.namespace File "", line 2, in UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 25: invalid continuation byte **

if options.remove_nonascii: nERROR = 0 print _("# Removing all acentuation from coding files in %s") % (options.folder) exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set() for dirpath, dirnames, filenames in os.walk(options.folder): for filename in (f for f in filenames if f.endswith(exts)): files.add(os.path.join(dirpath,filename)) for i in range(len(files)): f = files.pop() ; os.rename(f,f+'.BACK') with open(f,'w') as File: for line in open(f+'.BACK').readlines(): try: newLine = unicodedata.normalize('NFKD',unicode(line)).encode('ascii','ignore') File.write(newLine) except UnicodeDecodeError: nERROR +=1 print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i) newLine = line File.write(newLine)

unutbu · Accepted Answer · 2011-02-08 16:33:38Z

It looks like the file might be encoded with the cp1252 codec:

In [18]: print('\xf3'.decode('cp1252')) ó

unicode(line) is failing because unicode is trying to decode line with the utf-8 codec instead, hence the error UnicodeDecodeError: 'utf8' codec can't decode....

You might try decoding line with cp1252 first, then if that fails, try utf-8:

if options.remove_nonascii: nERROR = 0 print _("# Removing all acentuation from coding files in %s") % (options.folder) exts = ('.f90', '.f', '.cpp', '.c', '.hpp', '.h', '.py'); files=set() for dirpath, dirnames, filenames in os.walk(options.folder): for filename in (f for f in filenames if f.endswith(exts)): files.add(os.path.join(dirpath,filename)) for i,f in enumerate(files): os.rename(f,f+'.BACK') with open(f,'w') as fout: with open(f+'.BACK','r') as fin: for line fin: try: try: line=line.decode('cp1252') except UnicodeDecodeError: line=line.decode('utf-8') # If this still raises an UnicodeDecodeError, let the outer # except block handle it newLine = unicodedata.normalize('NFKD',line).encode('ascii','ignore') fout.write(newLine) except UnicodeDecodeError: nERROR +=1 print "ERROR n %i - Could not remove from Line: %i" % (nERROR,i) newLine = line fout.write(newLine)

By the way,

unicodedata.normalize('NFKD',line).encode('ascii','ignore')

is a bit dangerous. For example, it removes u'ß' and some quotation marks entirely:

In [23]: unicodedata.normalize('NFKD',u'ß').encode('ascii','ignore') Out[23]: '' In [24]: unicodedata.normalize('NFKD',u'‘’“”').encode('ascii','ignore') Out[24]: ''

If this is a problem, then use the unidecode module:

In [25]: import unidecode In [28]: print(unidecode.unidecode(u'‘’“”ß')) ''""ss

Vince · Accepted Answer · 2011-02-08 16:12:07Z

You might want to specify the encoding when using unicode(line), such as unicode(line, 'utf-8')

If you don't know it, sys.getfilesystemencoding() might be your friend.

Collectives™ on Stack Overflow

Python - Remove accents from all files in folder

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related