Python regex: pattern with re.ASCII can still match unicode characters?

Question

I am new to Python regex and am trying to match non-white space ASCII characters in Python.

The following is my code:

impore re p = re.compile(r"[\S]{2,3}", re.ASCII) p.search('1234') # have some result p.search('你好吗') # also have result, but Why?

I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?

you have to use \u refer this stackoverflow.com/questions/393843/… — Dickens A S
– Dickens A S, Commented Apr 14, 2020 at 4:08
@NikosM. so what does it mean? The doc says it will enforce ASCII mode if I understand correctly. — jdhao
– jdhao, Commented Apr 14, 2020 at 5:27
@jdhao still non-space is non-space even in enforced ascii mode, this is what you get, does not matter if is unicode, it is non-space even in enforced ascii — Nikos M.
– Nikos M., Commented Apr 14, 2020 at 5:28
@NikosM.N I kind of get the point. In ASCII mode, it is equivalent to [^ \t\n\r\f\v]. So unicode characters should match. Thanks!!! — jdhao
– jdhao, Commented Apr 14, 2020 at 5:32

Wiktor Stribiżew · Accepted Answer · 2020-04-14 08:02:18Z

The re.A flag only affects what shorthand character classes match.

In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:

\d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
\w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
\W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
\s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
\S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
\b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.

If you want to disable this behavior, you use re.A or re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

That means that:

\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
\D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.

Collectives™ on Stack Overflow

Python regex: pattern with re.ASCII can still match unicode characters?

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related