-1

How can I rewrite my anchor to be more general and correct in all situations? I have understood that using \b as an anchor is not optimal because it is implementation-dependent.

My goal is to match some type of word in a text file. For my question, the word to match is not of importance.

Assume \b is the word boundary anchor and a word character is [a-zA-Z0-9_] I constructed two anchors, one for the left and one for the right side of the regex. Notice how I handle the underscore, as I don't want it to be a word character when I read my text file.

  • (?<=\b|_) positive lookbehind
  • (?=\b|_) positive lookahead

What would be the equivalent anchor constructs but using the more general caret ^ and $ dollar sign to get the same effect?

2
  • 1
    I'll bet that any regexp engine that has lookarounds also has \b. Lookarounds are newer and not available as widely. Commented Jan 17, 2024 at 16:51
  • 1
    All regex symbols are implementation-dependent, including *, |, \ ... Commented Jan 17, 2024 at 18:44

2 Answers 2

2

[The OP did not specify which regex language they are using. This answer uses Perl's regex language, but the final solution should be easy to translate into other languages. Also, I use whitespace as if the x flag was provided, but that is also easily adjusted.]


With the help of a comment made by the OP, the following is my understanding of the question:

I have something like \b\w+\b, but I want to exclude _ the definition of a word.

You can use the following:

(?<! [^\W_] ) [^\W_]+ (?! [^\W_] ) 

An explanation follows.


\b is equivalent to (?: (?<!\w)(?=\w) | (?<=\w)(?!\w) ).

\b \w+ \b is therefore equivalent to (?<!\w) \w+ (?!\w) (after simplification).

So now we just need a pattern that matches everything \w matches but _. There are a few approaches that can be taken.

  • Set difference: (?[ \w - [_] ])
  • Look-ahead: (?!_)\w
  • Look-behind: \w(?<!_)
  • Double negation: [^\W_]

Even though it's the least readable, I'm going to use the last one since it's the best supported.

We now have

(?<! [^\W_] ) [^\W_]+ (?! [^\W_] ) 
Sign up to request clarification or add additional context in comments.

4 Comments

as I said in the post, the word is not important for my question. For example, I have this regex (?<=\b|_)sUs[^\W_]*(?=\b|_) which matches a word with the prefix sUs. The question was, can I replace the anchor expressions (?<=\b|_) and (?=\b|_) with some other expression that uses ^ and $ symbols AND does not use \b, but achieves the exact same result
Not using \b for no reason is stupid. (Implementation-dependent is not a reason, since \w is just as implementation-dependent.) However, (?<=\b|_) is a variable-length lookbehind, and that's not well supported. You could use (?<! [^\W_] ) sUs [^\W_]*. Answer updated.
I tested your updated answer against my text file, and it appears to be equivalent to my initial expression, and it indeed refrains from using \b. The idea about ^ and $ is a misunderstanding on OP (me) part. I guess my whole text file is counted as one large string, and ^ matches the beginning, and $ the end.
If you think ^ and $ aren't implementation-specific, you're wrong. They vary more than \b
1

You can match a non-word character or the beginning/end anchor:

(?:^|\W)(\w+)(?:\W|$) 

If you want to select something other than a single word, replace \w+ with the pattern you're looking for. Capture group 1 will contain what you're looking for.

4 Comments

Looks ok at the surface, but impractical in practice because it doesn't allow you to grab more all words, and when's the last time you wanted one arbitrary word from a string? This is easily fixed by replacing (^|\W)(\w+)(\W|$) with (\w+). /// There's also the issue that this answer doesn't identify any of a slew of other problems \w+ has at identifying words.
I've updated the answer to say that the capture group can be practically any pattern.
I meant you couldn't use the pattern to find all words (or whatever's in the middle) because you're consuming both edges instead of using (zero-width) anchors like in the OP.
Of course. Avoiding that requires using lookarounds. But as I mentioned above, I doubt there are any engines without word boundary that have lookarounds.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.