Here is a short function which strips the diacritics, but keeps the non-latin characters. Most cases (e.g., "à" -> "a") are handled by unicodedata (standard library), but several (e.g., "æ" -> "ae") rely on the given parallel strings.
Code
from unicodedata import combining, normalize LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))
NB. The default argument outliers is evaluated once and not meant to be provided by the caller.
Intended usage
As a key to sort a list of strings in a more “natural” order:
sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)
Output:
['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']
If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere.
Tests
To make sure the behavior meets your needs, take a look at the pangrams below:
examples = [ ("hello, world", "hello, world"), ("42", "42"), ("你好,世界", "你好,世界"), ( "Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.", "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.", ), ( "Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.", "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.", ), ( "Љубазни фењерџија чађавог лица хоће да ми покаже штос.", "љубазни фењерџија чађавог лица хоће да ми покаже штос.", ), ( "Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.", "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.", ), ( "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.", "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.", ), ( "Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.", "kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.", ), ( "Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.", "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.", ) ] for (given, expected) in examples: assert remove_diacritics(given) == expected
Case-preserving variant
LATIN = "ä æ ǽ đ ð ƒ ħ ı ł ø ǿ ö œ ß ŧ ü Ä Æ Ǽ Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö Œ ẞ Ŧ Ü " ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE" def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))): return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))