0

My current understanding of the python 3.4 regex library from the language reference does not seem to match up with my experiment results of the module.


My current understanding

The regular expression engine can be thought of as a separate entity with its own programming language that it understands (regex). It just happens to live inside python, among a variety of other languages. As such, python must pass (regex) pattern/code to this independent interpreter, if you will.

For clarity reasons, the following text will use the notion of logical length - which is supposed to represent how long the given string logically is. For example, the special character carriage return \r will have len=1 since it is a single character. However, the 2 distinct characters (backslash followed by an r) \r will have len=2.

Step 1) Lets say we want to match a carriage return \r len=1 in some text.

Step 2) We need to feed the pattern \r len=2 (2 distinct characters) to the regular expression engine.

Step 3) The regular expression engine recieves \r len=2 and interprets the pattern as: match special character carriage return \r len=1.

Step 4) It goes ahead and does the magic.

The problem is that the backslash character \ itself is used by the python interpreter as something special - a character meant to escape other stuff (like quotes).

So when we are coding in python and need to express the idea that we need to send the pattern \r len=2 to the internal regular expression interpreter, we must type pattern = '\\r' or alternatively pattern = r'\r' to express \r len=2.


And everything is well... until

I try a couple of experiments involving re.escape

enter image description here

enter image description here

enter image description here


Summary of questions

Point 1) Please confirm/modify my current understanding of the regex engine.

Point 2) Why are these supposed non-textbook definition patterns matching.

Point 3) What on earth is going on with \\\r from re.escape, and the whole "we have the same string lengths, but we compared unequal, but we ALSO all worked the same in matching a carriage return in the previous re.search test".

1
  • In the future please post formatted text, not screenshots. This makes it easier for others to copy and paste to replicate your issue locally, and also is more accessible to those using screenreaders, etc. Commented Mar 4, 2016 at 14:19

1 Answer 1

4

You need to understand that each time you write a pattern, it is first interpreted as a string before to be read and interpreted a second time by the regex engine. Lets describe what happens:

>>> s='\r' 

s contains the character CR.

>>> re.match('\r', s) <_sre.SRE_Match object; span=(0, 1), match='\r'> 

Here the string '\r' is a string that contains CR, so a literal CR is given to the regex engine.

>>> re.match('\\r', s) <_sre.SRE_Match object; span=(0, 1), match='\r'> 

The string is now a literal backslash and a literal r, the regex engine receives these two characters and since \r is a regex escape sequence that means a CR character too, you obtain a match too.

>>> re.match('\\\r', s) <_sre.SRE_Match object; span=(0, 1), match='\r'> 

The string contains a literal backslash and a literal CR, the regex engine receives \ and CR, but since \CR isn't a known regex escape sequence, the backslash is ignored and you obtain a match.

Note that for the regex engine, a literal backslash is the escape sequence \\ (so in a pattern string r'\\' or '\\\\')

Sign up to request clarification or add additional context in comments.

2 Comments

when you say the backslash is ignored in your last example \CR, does it mean that the regex engine silently converts \CR to CR? So any unknown sequence simply becomes the code (stripped from the backslash)? lets say \X \Y \Z are all unknown, then pat\Xte\Yrn\Z will also silently become pattern?
@Alan: Exactly, test yourself: re.match(r'\l', 'l') or re.match('\\l', 'l'). Only the backslash is ignored, not the following character: pat\Xte\Yrn\K becomes patXteYrnK for the regex engine. (\Z has a special meaning)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.