Removing all non-letter chars from a string with accents in Python

Question

I'm trying to delete all non-letter chars (except white-space) from a string containing accents using Python 3.7. I tried the following:

import re text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)." clean_text = re.sub('[\W_\d]+', ' ', text) print(clean_text)

The output is

Андре й Серге евич Арша вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

Why do I get a whitespace after the accented char in my result string? This seems to violate the principle of least surprise. So I tried a different solution

text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)." clean_text2 = "".join(c for c in text if c.isalpha() or c == " ") print(clean_text2)

The output is

Андрей Сергеевич Аршавин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

This is nearly what I wanted, except that it removes the accents from the chars. I would like to have the following result:

Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

Is there a way to remove all non-letter chars from a string, but keep the accents on the chars?

"Why do I get a whitespace after the accented char in my result string?" - because the accents a separate character: text[5] == '́', since Russian doesn't have accents, unlike French, for example. — ForceBru
– ForceBru, Commented May 4, 2020 at 21:12

Wiktor Stribiżew · Accepted Answer · 2020-05-04 22:18:56Z

Basic solution for Russian word stress symbols

Russian letters do not have accents, the accent you have in the string shows the word stress, and is only used in specific written speech, like in textbooks for foreigners, books for children, etc.

The е́ is a e letter and the \u0301 char, 0301 COMBINING ACUTE ACCENT. The only accent diacritic can be subtracted from your pattern to get the results you want:

clean_text = re.sub(r'(?:(?!\u0301)[\W\d_])+', ' ', text)

See the Python demo yielding

Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

See the regex demo online.

Solution supporting all diacritics - PyPi regex module

To keep all diacritic marks, the easiest is to install PyPi regex module (with pip install regex) and then use \p{L} and \p{M} Unicode property classes:

import regex text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)." print ( regex.sub(r'[^\p{L}\p{M}]+', ' ', text) ) # => Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России print( " ".join(regex.findall(r'(?>\p{L}\p{M}*+)+', text)) ) # => Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

Here, \[^\p{L}\p{M}\]+ regex matches any 1 or more chars other than Unicode letters (\p{L}) and diacritic characters (\p{M}). The other solution, (?>\p{L}\p{M}*+)+ with re.findall, extracts all letter + diacritic chunks from the text and then " ".join(...) concats them with a space.

Diacritics support with Python re

You will need to "spell out" the \p{M} class and you may match any Unicode letter using [^\W\d_] construct. It makes sense to use the find-all-words-and-then-concatenate approach here rather than re.sub:

import re combining_marks_bmp = '\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C03\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D01-\u0D03\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EB9\u0EBB\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF2-\u1CF4\u1CF8\u1CF9\u1DC0-\u1DF5\u1DFC-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C4\uA8E0-\uA8F1\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F' combining_marks_astral = '\uD805[\uDCB0-\uDCC3\uDDAF-\uDDB5\uDDB8-\uDDC0\uDDDC\uDDDD\uDE30-\uDE40\uDEAB-\uDEB7\uDF1D-\uDF2B]|\uD834[\uDD65-\uDD69\uDD6D-\uDD72\uDD7B-\uDD82\uDD85-\uDD8B\uDDAA-\uDDAD\uDE42-\uDE44]|\uD804[\uDC00-\uDC02\uDC38-\uDC46\uDC7F-\uDC82\uDCB0-\uDCBA\uDD00-\uDD02\uDD27-\uDD34\uDD73\uDD80-\uDD82\uDDB3-\uDDC0\uDDCA-\uDDCC\uDE2C-\uDE37\uDEDF-\uDEEA\uDF00-\uDF03\uDF3C\uDF3E-\uDF44\uDF47\uDF48\uDF4B-\uDF4D\uDF57\uDF62\uDF63\uDF66-\uDF6C\uDF70-\uDF74]|\uD81B[\uDF51-\uDF7E\uDF8F-\uDF92]|\uD81A[\uDEF0-\uDEF4\uDF30-\uDF36]|\uD82F[\uDC9D\uDC9E]|\uD800[\uDDFD\uDEE0\uDF76-\uDF7A]|\uD836[\uDE00-\uDE36\uDE3B-\uDE6C\uDE75\uDE84\uDE9B-\uDE9F\uDEA1-\uDEAF]|\uD802[\uDE01-\uDE03\uDE05\uDE06\uDE0C-\uDE0F\uDE38-\uDE3A\uDE3F\uDEE5\uDEE6]|\uD83A[\uDCD0-\uDCD6]|\uDB40[\uDD00-\uDDEF]' letter = r'[^\W\d_]' pat = re.compile(r'(?:{}|[{}]|{})+'.format(letter,combining_marks_bmp, combining_marks_astral)) print(" ".join(pat.findall(text))) # => Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

See the online Python demo

I feel like OP would benefit from checking out the "Combining Diacritical Marks" Block where the accent \u0301 is housed, and evaluate whether other accent characters might be required.
@r.ook Regarding other accent characters might be required: There are no accents in Russian language. The accent mark used here is a rare occasion of the word stress in writing, they are not used to change word meaning, just show where you should pronounce a vowel with greater force. They are usually not used at all since we know where to stress words. No need for \p{M} here, it will be redundant overhead. UNLESS the user needs to handle other languages where diacritics are used as meaning changing symbols.
It is true that my example (from a textbook for foreigners learning russian) has only a few accent marks to show word stress. However I would also be interested in a general solution, which preserves all "combining diacritical marks" in any kind of text.
@asmaier Do you think the PyPi regex module solution is enough? Or do you want a re solution as well?

user13469682 · Accepted Answer · 2020-05-04 21:27:29Z

Try (?:[^\w\x{301}\s]|[\d_])+
Use \u0301 instead of \x{301} if it uses that notation

Or use properties if supported

[^\p{L}\x{0301}\s]+

dmmfll · Accepted Answer · 2020-05-04 22:11:02Z

I want to offer a Pythonic solution that does not involve regular expressions.

It uses the translate method on strings.

1.Python documenation on str.maketrans

2.Python documentation on str.translate

from string import digits import itertools as it import unicodedata # This creates a special dictionary to pass to the translation method. # This will replace all digits and punctuation with an empty string translation = str.maketrans( dict( zip( ( *digits, *( # punctuation item for item in set(text) if unicodedata.category(item).startswith("P") ), ), it.cycle(("",)), ) ) ) print(" ".join(text.translate(translation).split()))

OUTPUT:

Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

You can choose any character for the substitution. I chose an empty string "" for deletion.

Close, but I would like to also remove the — from the text.
I just noticed that, too. I am going to see if there is some version of punctuation that contains extra punctuation characters.
I found a possible hack for identifying unicode punctuation. I found it here: groups.google.com/forum/#!topic/comp.lang.python/5tnXt-o534Y See the edit. I added a translation for all unicode punctuation.

Collectives™ on Stack Overflow

Removing all non-letter chars from a string with accents in Python

3 Answers 3

7 Comments

Comments

3 Comments

Linked

Hot Network Questions