0

Can someone please explain this behavior of regex:

When I replace the last two characters of a Unicode string with some other Unicode character it works fine with line-boundary ($) at the end of string but generates unexpected results if I specify the $ in square braces [$].

Also the word boundary \b is giving unexpected results and surprisingly \Bmatches what \b is supposed to match.

>>> line = u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670$', ur'\u0627', line) #works fine u'\u0627\u062f\u0646\u0627' >>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line) #unexpected result u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670[$]', ur'\u0627', line, re.U) #still not working u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670\b', ur'\u0627', line, re.U) #unexpected u'\u0627\u062f\u0646\u06cc\u0670' >>> re.sub(ur'\u06cc\u0670\B', ur'\u0627', line, re.U) #unexpected u'\u0627\u062f\u0646\u0627' 

1 Answer 1

1
  1. The signature of re.sub is:

    sub(pattern, repl, string, count=0, flags=0) 

    The re.U flag is being passed to count, so the re.U flag does nothing. Make sure you use the keyword argument like:

    re.sub(ur'\u06cc\u0670\b', u'\u0627', line, flags=re.U) # ^~~~~~ 
  2. […] defines a character class, and $ is not special inside the brackets. So [$] will just match a literal dollar sign.

  3. \b matches the boundary between a word ("\w") and non-word ("\W", or the start/end of string), and \B matches anywhere that is not \b. Now, \u0670 is a non-word in Unicode:

    >>> re.findall(ur'\w', line, flags=re.U) [u'\u0627', u'\u062f', u'\u0646', u'\u06cc'] >>> re.findall(ur'\W', line, flags=re.U) [u'\u0670'] 

    This means the end of string after \u0670 is not a word-boundary, because \u0670 is not a word. So \b cannot match it, and that means \B will match it.

    The meaning of \w in Unicode is "[0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database".

    Characters like U+06CC (Arabic Letter Farsi Yeh) is categorized as Letter, Other (Lo) so it is a word, but U+0670 (Arabic Letter Superscript Alef) is categorized as Mark, Nonspacing (Mn) so it is not considered a word.

(You may check detail of Python's regex syntax in https://docs.python.org/2/library/re.html)


As for the comment below, you can use a negative look-ahead instead of a group:

re.sub(ur'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U) 

Here,

  • [\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2] is the same as your \u06cc\u0670|\u06d2\u0670|\u0670\u06cc|\u0670\u06d2, but with similar cases grouped together
  • (?:…) defines a non-capturing group, so that the "\b" you want can be extracted out from the alternations
  • (?!\w) means we match only if the next character is not a word.

The result is like:

>>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line, flags=re.U) u'\u0627\u062f\u0646\u0627' >>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u0646', flags=re.U) u'\u0627\u062f\u0646\u06cc\u0670\u0646' >>> re.sub(u'(?:[\u06cc\u06d2]\u0670|\u0670[\u06cc\u06d2])(?!\w)', u'\u0627', line + u'\u061f', flags=re.U) u'\u0627\u062f\u0646\u0627\u061f' 
Sign up to request clarification or add additional context in comments.

1 Comment

Actually I want something like re.sub(u'\u06cc\u0670\b|\u06d2\u0670\b|\u0670\u06cc\b|\u0670\u06d2\b', u'\u0627', line, re.U) but since \b wont work in this case, so I tried re.sub(u'\u06cc\u0670(\s)|\u06d2\u0670(\s)|\u0670\u06cc(\s)|\u0670\u06d2(\s)', ur'\u0627\1', line, re.U) but now I don't know which group number I've to replace with \1 or \2 or ... . Can you please help me with this.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.