3

I am new to Python regex and am trying to match non-white space ASCII characters in Python.

The following is my code:

impore re p = re.compile(r"[\S]{2,3}", re.ASCII) p.search('1234') # have some result p.search('你好吗') # also have result, but Why? 

I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?

5
  • you have to use \u refer this stackoverflow.com/questions/393843/… Commented Apr 14, 2020 at 4:08
  • re.ASCII is not what you think, in this case Commented Apr 14, 2020 at 4:21
  • @NikosM. so what does it mean? The doc says it will enforce ASCII mode if I understand correctly. Commented Apr 14, 2020 at 5:27
  • 1
    @jdhao still non-space is non-space even in enforced ascii mode, this is what you get, does not matter if is unicode, it is non-space even in enforced ascii Commented Apr 14, 2020 at 5:28
  • @NikosM.N I kind of get the point. In ASCII mode, it is equivalent to [^ \t\n\r\f\v]. So unicode characters should match. Thanks!!! Commented Apr 14, 2020 at 5:32

1 Answer 1

5

The re.A flag only affects what shorthand character classes match.

In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:

  • \d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
  • \D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
  • \w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
  • \W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
  • \s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
  • \S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
  • \b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
  • \B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.

If you want to disable this behavior, you use re.A or re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

That means that:

  • \d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
  • \D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
  • \w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
  • \W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
  • \s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
  • \S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.