How does the regular expression ‘(?<=#)[^#]+(?=#)’ work?

Question

I have the following regex in a C# program, and have difficulties understanding it:

(?<=#)[^#]+(?=#)

I'll break it down to what I think I understood:

(?<=#) a group, matching a hash. what's `?<=`? [^#]+ one or more non-hashes (used to achieve non-greediness) (?=#) another group, matching a hash. what's the `?=`?

So the problem I have is the ?<= and ?< part. From reading MSDN, ?<name> is used for naming groups, but in this case the angle bracket is never closed.

I couldn't find ?= in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.

Check this for an explanation on lookaround stackoverflow.com/questions/2973436/… — Amarghosh
– Amarghosh, Commented Jun 22, 2010 at 12:01

polygenelubricants · Accepted Answer · 2010-06-22 12:25:42Z

They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:

Positive lookarounds: see if we CAN match the pattern...
- (?=pattern) - ... to the right of current position (look ahead)
- (?<=pattern) - ... to the left of current position (look behind)
Negative lookarounds - see if we can NOT match the pattern
- (?!pattern) - ... to the right
- (?<!pattern) - ... to the left

As an easy reminder, for a lookaround:

= is positive, ! is negative
< is look behind, otherwise it's look ahead

References

regular-expressions.info/Lookarounds

But why use lookarounds?

One might argue that lookarounds in the pattern above aren't necessary, and #([^#]+)# will do the job just fine (extracting the string captured by \1 to get the non-#).

Not quite. The difference is that since a lookaround doesn't match the #, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.

Consider the following input string:

and #one# and #two# and #three#four#

Now, #([a-z]+)# will give the following matches (as seen on rubular.com):

and #one# and #two# and #three#four# \___/ \___/ \_____/

Compare this with (?<=#)[a-z]+(?=#), which matches:

and #one# and #two# and #three#four# \_/ \_/ \___/ \__/

Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with #([a-z]+)(?=#), which matches (as seen on rubular.com):

and #one# and #two# and #three#four# \__/ \__/ \____/\___/

References

regular-expressions.info/Flavor Comparison

John Feminella · Accepted Answer · 2010-06-22 13:17:57Z

As another poster mentioned, these are lookarounds, special constructs for changing what gets matched and when. This says:

(?<=#) match but don't capture, the string `#` when followed by the next expression [^#]+ one or more characters that are not `#`, and (?=#) match but don't capture, the string `#` when preceded by the last expression

So this will match all the characters in between two #s.

Lookaheads and lookbehinds are very useful in many cases. Consider, for example, the rule "match all bs not followed by an a." Your first attempt might be something like b[^a], but that's not right: this will also match the bu in bus or the bo in boy, but you only wanted the b. And it won't match the b in cab, even though that's not followed by an a, because there are no more characters to match.

To do that correctly, you need a lookahead: b(?!a). This says "match a b but don't match an a afterwards, and don't make that part of the match". Thus it'll match just the b in bolo, which is what you want; likewise it'll match the b in cab.

You said: b(?!a) - "This says 'match a b followed by something that is not an a'" -- I think that's misleading, actually. It says "match a b, after which you can't match an a." In particular, b doesn't really have to be followed by anything; it most definitely doesn't have to be followed by [^a]. It can be at the end of the string. That's where b(?!a) and b(?=[^a]) differ.
You're right, that wasn't the best wording. I'll edit to clarify.

Collectives™ on Stack Overflow

How does the regular expression ‘(?<=#)[^#]+(?=#)’ work?

2 Answers 2

References

But why use lookarounds?

References

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

References

But why use lookarounds?

References

Comments

2 Comments

Linked

Related