I would like to compare 2 strings and have True if the strings are identical, without considering the accents.
Example : I would like the following code to print 'Bonjour'
if 'séquoia' in 'Mon sequoia est vert': print 'Bonjour' I would like to compare 2 strings and have True if the strings are identical, without considering the accents.
Example : I would like the following code to print 'Bonjour'
if 'séquoia' in 'Mon sequoia est vert': print 'Bonjour' You should use unidecode function from Unidecode package:
from unidecode import unidecode if unidecode(u'séquoia') in 'Mon sequoia est vert': print 'Bonjour' You should take a look at Unidecode. With the module and this method, you can get a string without accent and then make your comparaison:
def remove_accents(data): return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower() if remove_accents('séquoia') in 'Mon sequoia est vert': # Do something pass remove_accents method makes all of the characters lowercase.(sorry, late to the party!!)
How about instead, doing this:
>>> unicodedata.normalize('NFKD', 'î ï í ī į ì').encode('ASCII', 'ignore').decode('ascii') 'i i i i i i' No need to loop over anything. @Maxime Lorant answer is very inefficient.
>>> import timeit >>> code = """ import string, unicodedata def remove_accents(data): return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower() """ >>> timeit.timeit("remove_accents('séquoia')", setup=code) 3.6028339862823486 >>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup='import unicodedata') 0.7447490692138672 Hint: less is better
Also, I'm sure the package unidecode @Seur suggested has other advantages, but it is still very slow compared to the native option that requires no 3rd party libraries.
>>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup="import unicodedata") 0.7662729263305664 >>> timeit.timeit("unidecode.unidecode('séquoia')", setup="import unidecode") 7.489392042160034 Hint: less is better
Putting it all together:
clean_text = unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore').decode('ascii') if clean_text in 'Mon sequoia est vert': ... in operator. Bytes can't be at the left side of in. unicodedata.normalize('NFKD', 'î ï í ī į ì Í').encode('ASCII', 'ignore').decode('ASCII') I think the answer above is for Python 2.xu in front of the strings