4

I'm trying to delete all non-letter chars (except white-space) from a string containing accents using Python 3.7. I tried the following:

import re text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)." clean_text = re.sub('[\W_\d]+', ' ', text) print(clean_text) 

The output is

Андре й Серге евич Арша вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России 

Why do I get a whitespace after the accented char in my result string? This seems to violate the principle of least surprise. So I tried a different solution

text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)." clean_text2 = "".join(c for c in text if c.isalpha() or c == " ") print(clean_text2) 

The output is

Андрей Сергеевич Аршавин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России 

This is nearly what I wanted, except that it removes the accents from the chars. I would like to have the following result:

Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России 

Is there a way to remove all non-letter chars from a string, but keep the accents on the chars?

2
  • 2
    "Why do I get a whitespace after the accented char in my result string?" - because the accents a separate character: text[5] == '́', since Russian doesn't have accents, unlike French, for example. Commented May 4, 2020 at 21:12
  • try: re.sub(r'[^\pLе́]+', ' ', text) Commented May 4, 2020 at 21:20

3 Answers 3

3

Basic solution for Russian word stress symbols

Russian letters do not have accents, the accent you have in the string shows the word stress, and is only used in specific written speech, like in textbooks for foreigners, books for children, etc.

The е́ is a e letter and the \u0301 char, 0301 COMBINING ACUTE ACCENT. The only accent diacritic can be subtracted from your pattern to get the results you want:

clean_text = re.sub(r'(?:(?!\u0301)[\W\d_])+', ' ', text) 

See the Python demo yielding

Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России 

See the regex demo online.

Solution supporting all diacritics - PyPi regex module

To keep all diacritic marks, the easiest is to install PyPi regex module (with pip install regex) and then use \p{L} and \p{M} Unicode property classes:

import regex text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)." print ( regex.sub(r'[^\p{L}\p{M}]+', ' ', text) ) # => Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России print( " ".join(regex.findall(r'(?>\p{L}\p{M}*+)+', text)) ) # => Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России 

Here, \[^\p{L}\p{M}\]+ regex matches any 1 or more chars other than Unicode letters (\p{L}) and diacritic characters (\p{M}). The other solution, (?>\p{L}\p{M}*+)+ with re.findall, extracts all letter + diacritic chunks from the text and then " ".join(...) concats them with a space.

Diacritics support with Python re

You will need to "spell out" the \p{M} class and you may match any Unicode letter using [^\W\d_] construct. It makes sense to use the find-all-words-and-then-concatenate approach here rather than re.sub:

import re combining_marks_bmp = '\u0300-\u036F\u0483-\u0489\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06DC\u06DF-\u06E4\u06E7\u06E8\u06EA-\u06ED\u0711\u0730-\u074A\u07A6-\u07B0\u07EB-\u07F3\u0816-\u0819\u081B-\u0823\u0825-\u0827\u0829-\u082D\u0859-\u085B\u08E3-\u0903\u093A-\u093C\u093E-\u094F\u0951-\u0957\u0962\u0963\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A70\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0C00-\u0C03\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C81-\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0D01-\u0D03\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2\u0DF3\u0E31\u0E34-\u0E3A\u0E47-\u0E4E\u0EB1\u0EB4-\u0EB9\u0EBB\u0EBC\u0EC8-\u0ECD\u0F18\u0F19\u0F35\u0F37\u0F39\u0F3E\u0F3F\u0F71-\u0F84\u0F86\u0F87\u0F8D-\u0F97\u0F99-\u0FBC\u0FC6\u102B-\u103E\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F\u109A-\u109D\u135D-\u135F\u1712-\u1714\u1732-\u1734\u1752\u1753\u1772\u1773\u17B4-\u17D3\u17DD\u180B-\u180D\u18A9\u1920-\u192B\u1930-\u193B\u1A17-\u1A1B\u1A55-\u1A5E\u1A60-\u1A7C\u1A7F\u1AB0-\u1ABE\u1B00-\u1B04\u1B34-\u1B44\u1B6B-\u1B73\u1B80-\u1B82\u1BA1-\u1BAD\u1BE6-\u1BF3\u1C24-\u1C37\u1CD0-\u1CD2\u1CD4-\u1CE8\u1CED\u1CF2-\u1CF4\u1CF8\u1CF9\u1DC0-\u1DF5\u1DFC-\u1DFF\u20D0-\u20F0\u2CEF-\u2CF1\u2D7F\u2DE0-\u2DFF\u302A-\u302F\u3099\u309A\uA66F-\uA672\uA674-\uA67D\uA69E\uA69F\uA6F0\uA6F1\uA802\uA806\uA80B\uA823-\uA827\uA880\uA881\uA8B4-\uA8C4\uA8E0-\uA8F1\uA926-\uA92D\uA947-\uA953\uA980-\uA983\uA9B3-\uA9C0\uA9E5\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA7B-\uAA7D\uAAB0\uAAB2-\uAAB4\uAAB7\uAAB8\uAABE\uAABF\uAAC1\uAAEB-\uAAEF\uAAF5\uAAF6\uABE3-\uABEA\uABEC\uABED\uFB1E\uFE00-\uFE0F\uFE20-\uFE2F' combining_marks_astral = '\uD805[\uDCB0-\uDCC3\uDDAF-\uDDB5\uDDB8-\uDDC0\uDDDC\uDDDD\uDE30-\uDE40\uDEAB-\uDEB7\uDF1D-\uDF2B]|\uD834[\uDD65-\uDD69\uDD6D-\uDD72\uDD7B-\uDD82\uDD85-\uDD8B\uDDAA-\uDDAD\uDE42-\uDE44]|\uD804[\uDC00-\uDC02\uDC38-\uDC46\uDC7F-\uDC82\uDCB0-\uDCBA\uDD00-\uDD02\uDD27-\uDD34\uDD73\uDD80-\uDD82\uDDB3-\uDDC0\uDDCA-\uDDCC\uDE2C-\uDE37\uDEDF-\uDEEA\uDF00-\uDF03\uDF3C\uDF3E-\uDF44\uDF47\uDF48\uDF4B-\uDF4D\uDF57\uDF62\uDF63\uDF66-\uDF6C\uDF70-\uDF74]|\uD81B[\uDF51-\uDF7E\uDF8F-\uDF92]|\uD81A[\uDEF0-\uDEF4\uDF30-\uDF36]|\uD82F[\uDC9D\uDC9E]|\uD800[\uDDFD\uDEE0\uDF76-\uDF7A]|\uD836[\uDE00-\uDE36\uDE3B-\uDE6C\uDE75\uDE84\uDE9B-\uDE9F\uDEA1-\uDEAF]|\uD802[\uDE01-\uDE03\uDE05\uDE06\uDE0C-\uDE0F\uDE38-\uDE3A\uDE3F\uDEE5\uDEE6]|\uD83A[\uDCD0-\uDCD6]|\uDB40[\uDD00-\uDDEF]' letter = r'[^\W\d_]' pat = re.compile(r'(?:{}|[{}]|{})+'.format(letter,combining_marks_bmp, combining_marks_astral)) print(" ".join(pat.findall(text))) # => Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России 

See the online Python demo

Sign up to request clarification or add additional context in comments.

7 Comments

I feel like OP would benefit from checking out the "Combining Diacritical Marks" Block where the accent \u0301 is housed, and evaluate whether other accent characters might be required.
@r.ook Regarding other accent characters might be required: There are no accents in Russian language. The accent mark used here is a rare occasion of the word stress in writing, they are not used to change word meaning, just show where you should pronounce a vowel with greater force. They are usually not used at all since we know where to stress words. No need for \p{M} here, it will be redundant overhead. UNLESS the user needs to handle other languages where diacritics are used as meaning changing symbols.
It is true that my example (from a textbook for foreigners learning russian) has only a few accent marks to show word stress. However I would also be interested in a general solution, which preserves all "combining diacritical marks" in any kind of text.
@asmaier Do you think the PyPi regex module solution is enough? Or do you want a re solution as well?
@asmaier Added another one, for re.
|
0

Try (?:[^\w\x{301}\s]|[\d_])+
Use \u0301 instead of \x{301} if it uses that notation

Or use properties if supported

[^\p{L}\x{0301}\s]+

Comments

0

I want to offer a Pythonic solution that does not involve regular expressions.

It uses the translate method on strings.

1.Python documenation on str.maketrans

2.Python documentation on str.translate

from string import digits import itertools as it import unicodedata # This creates a special dictionary to pass to the translation method. # This will replace all digits and punctuation with an empty string translation = str.maketrans( dict( zip( ( *digits, *( # punctuation item for item in set(text) if unicodedata.category(item).startswith("P") ), ), it.cycle(("",)), ) ) ) print(" ".join(text.translate(translation).split())) 

OUTPUT:

Андре́й Серге́евич Арша́вин род мая Ленинград российский футболист бывший капитан сборной России заслуженный мастер спорта России

You can choose any character for the substitution. I chose an empty string "" for deletion.

3 Comments

Close, but I would like to also remove the from the text.
I just noticed that, too. I am going to see if there is some version of punctuation that contains extra punctuation characters.
I found a possible hack for identifying unicode punctuation. I found it here: groups.google.com/forum/#!topic/comp.lang.python/5tnXt-o534Y See the edit. I added a translation for all unicode punctuation.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.