Compare 2 strings without considering accents in Python [duplicate]

Question

I would like to compare 2 strings and have True if the strings are identical, without considering the accents.

Example : I would like the following code to print 'Bonjour'

if 'séquoia' in 'Mon sequoia est vert': print 'Bonjour'

Convert to fully decomposed normal form, remove accents, compare. — tripleee
– tripleee, Commented Dec 22, 2013 at 13:17

vikingosegundo · Accepted Answer · 2013-12-22 13:34:25Z

15

You should use unidecode function from Unidecode package:

from unidecode import unidecode if unidecode(u'séquoia') in 'Mon sequoia est vert': print 'Bonjour'

edited Dec 22, 2013 at 13:34

vikingosegundo

52.3k14 gold badges140 silver badges184 bronze badges

answered Dec 22, 2013 at 13:24

Suor

3,0831 gold badge24 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:07:58Z

You should take a look at Unidecode. With the module and this method, you can get a string without accent and then make your comparaison:

def remove_accents(data): return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower() if remove_accents('séquoia') in 'Mon sequoia est vert': # Do something pass

Reference from stackoverflow

This would not work if the word was "séQuoIa" since the remove_accents method makes all of the characters lowercase.

Javier Buzzi · Accepted Answer · 2024-10-28 18:30:42Z

(sorry, late to the party!!)

How about instead, doing this:

>>> unicodedata.normalize('NFKD', 'î ï í ī į ì').encode('ASCII', 'ignore').decode('ascii') 'i i i i i i'

No need to loop over anything. @Maxime Lorant answer is very inefficient.

>>> import timeit >>> code = """ import string, unicodedata def remove_accents(data): return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower() """ >>> timeit.timeit("remove_accents('séquoia')", setup=code) 3.6028339862823486 >>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup='import unicodedata') 0.7447490692138672

Hint: less is better

Also, I'm sure the package unidecode @Seur suggested has other advantages, but it is still very slow compared to the native option that requires no 3rd party libraries.

>>> timeit.timeit("unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore')", setup="import unicodedata") 0.7662729263305664 >>> timeit.timeit("unidecode.unidecode('séquoia')", setup="import unidecode") 7.489392042160034

Hint: less is better

Putting it all together:

clean_text = unicodedata.normalize('NFKD', 'séquoia').encode('ASCII', 'ignore').decode('ascii') if clean_text in 'Mon sequoia est vert': ...

I tried using Python 3.12. If we need to encode with ASCII at all we need also to revert it to be able to use the in operator. Bytes can't be at the left side of in. unicodedata.normalize('NFKD', 'î ï í ī į ì Í').encode('ASCII', 'ignore').decode('ASCII') I think the answer above is for Python 2.x
@ArpadHorvath--СлаваУкраїні It is for py 2.x, this is over 4 years old :P notice the u in front of the strings

Collectives™ on Stack Overflow

Compare 2 strings without considering accents in Python [duplicate]

3 Answers 3

Comments

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

2 Comments

Linked

Related