22

I have the following regex in a C# program, and have difficulties understanding it:

(?<=#)[^#]+(?=#) 

I'll break it down to what I think I understood:

(?<=#) a group, matching a hash. what's `?<=`? [^#]+ one or more non-hashes (used to achieve non-greediness) (?=#) another group, matching a hash. what's the `?=`? 

So the problem I have is the ?<= and ?< part. From reading MSDN, ?<name> is used for naming groups, but in this case the angle bracket is never closed.

I couldn't find ?= in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.

1

2 Answers 2

36

They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:

  • Positive lookarounds: see if we CAN match the pattern...
    • (?=pattern) - ... to the right of current position (look ahead)
    • (?<=pattern) - ... to the left of current position (look behind)
  • Negative lookarounds - see if we can NOT match the pattern
    • (?!pattern) - ... to the right
    • (?<!pattern) - ... to the left

As an easy reminder, for a lookaround:

  • = is positive, ! is negative
  • < is look behind, otherwise it's look ahead

References


But why use lookarounds?

One might argue that lookarounds in the pattern above aren't necessary, and #([^#]+)# will do the job just fine (extracting the string captured by \1 to get the non-#).

Not quite. The difference is that since a lookaround doesn't match the #, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.

Consider the following input string:

and #one# and #two# and #three#four# 

Now, #([a-z]+)# will give the following matches (as seen on rubular.com):

and #one# and #two# and #three#four# \___/ \___/ \_____/ 

Compare this with (?<=#)[a-z]+(?=#), which matches:

and #one# and #two# and #three#four# \_/ \_/ \___/ \__/ 

Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with #([a-z]+)(?=#), which matches (as seen on rubular.com):

and #one# and #two# and #three#four# \__/ \__/ \____/\___/ 

References

Sign up to request clarification or add additional context in comments.

Comments

4

As another poster mentioned, these are lookarounds, special constructs for changing what gets matched and when. This says:

(?<=#) match but don't capture, the string `#` when followed by the next expression [^#]+ one or more characters that are not `#`, and (?=#) match but don't capture, the string `#` when preceded by the last expression 

So this will match all the characters in between two #s.

Lookaheads and lookbehinds are very useful in many cases. Consider, for example, the rule "match all bs not followed by an a." Your first attempt might be something like b[^a], but that's not right: this will also match the bu in bus or the bo in boy, but you only wanted the b. And it won't match the b in cab, even though that's not followed by an a, because there are no more characters to match.

To do that correctly, you need a lookahead: b(?!a). This says "match a b but don't match an a afterwards, and don't make that part of the match". Thus it'll match just the b in bolo, which is what you want; likewise it'll match the b in cab.

2 Comments

You said: b(?!a) - "This says 'match a b followed by something that is not an a'" -- I think that's misleading, actually. It says "match a b, after which you can't match an a." In particular, b doesn't really have to be followed by anything; it most definitely doesn't have to be followed by [^a]. It can be at the end of the string. That's where b(?!a) and b(?=[^a]) differ.
You're right, that wasn't the best wording. I'll edit to clarify.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.