Unexpected behavior of regex word boundaries with Unicode strings

Question

Can someone please explain this behavior of regex:

When I replace the last two characters of a Unicode string with some other Unicode character it works fine with line-boundary ($) at the end of string but generates unexpected results if I specify the $ in square braces [$].

Also the word boundary \b is giving unexpected results and surprisingly \Bmatches what \b is supposed to match.

>>> line = u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670$', ur'\u0627', line) #works fine u'\u0627\u062f\u0646\u0627' >>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line) #unexpected result u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line, re.U) #still not working u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670\b', ur'\u0627', line, re.U) #unexpected u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670\B', ur'\u0627', line, re.U) #unexpected u'\u0627\u062f\u0646\u0627'

Community · Accepted Answer · 2017-05-23 11:52:04Z

The signature of re.sub is:
```
sub(pattern, repl, string, count=0, flags=0) 
```
The re.U flag is being passed to count, so the re.U flag does nothing. Make sure you use the keyword argument like:
```
re.sub(ur'\u06cc\u0670\b', u'\u0627', line, flags=re.U) # ^~~~~~ 
```
[…] defines a character class, and $ is not special inside the brackets. So [$] will just match a literal dollar sign.
\b matches the boundary between a word ("\w") and non-word ("\W", or the start/end of string), and \B matches anywhere that is not \b. Now, \u0670 is a non-word in Unicode:
```
>>> re.findall(ur'\w', line, flags=re.U) [u'\u0627', u'\u062f', u'\u0646', u'\u06cc'] >>> re.findall(ur'\W', line, flags=re.U) [u'\u0670'] 
```
This means the end of string after \u0670 is not a word-boundary, because \u0670 is not a word. So \b cannot match it, and that means \B will match it.

The meaning of \w in Unicode is "[0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database".

Characters like U+06CC (Arabic Letter Farsi Yeh) is categorized as Letter, Other (Lo) so it is a word, but U+0670 (Arabic Letter Superscript Alef) is categorized as Mark, Nonspacing (Mn) so it is not considered a word.

(You may check detail of Python's regex syntax in https://docs.python.org/2/library/re.html)

As for the comment below, you can use a negative look-ahead instead of a group:

re.sub(ur'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U)

Here,

[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2] is the same as your \u06cc\u0670|\u06d2\u0670|\u0670\u06cc|\u0670\u06d2, but with similar cases grouped together
(?:…) defines a non-capturing group, so that the "\b" you want can be extracted out from the alternations
(?!\w) means we match only if the next character is not a word.

The result is like:

>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U) u'\u0627\u062f\u0646\u0627' >>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u0646', flags=re.U) u'\u0627\u062f\u0646\u06cc\u0670\u0646' >>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u061f', flags=re.U) u'\u0627\u062f\u0646\u0627\u061f'

Actually I want something like re.sub(u'\u06cc\u0670\b|\u06d2\u0670\b|\u0670\u06cc\b|\u0670\u06d2\b', u'\u0627', line, re.U) but since \b wont work in this case, so I tried re.sub(u'\u06cc\u0670(\s)|\u06d2\u0670(\s)|\u0670\u06cc(\s)|\u0670\u06d2(\s)', ur'\u0627\1', line, re.U) but now I don't know which group number I've to replace with \1 or \2 or ... . Can you please help me with this.

Collectives™ on Stack Overflow

Unexpected behavior of regex word boundaries with Unicode strings

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related