9

C++11 has 6 different regular expression grammars you can use. In my case, I am interacting with a component that is using modified ECMAScript regular expressions.

I need to create a regular expression "match a string starting with X", where X is a string literal I have.

So the regular expression I want is roughly ^X.*. Except the string X could contain more regular expression special characters, and I want them to occur.

Which means I really want ^ escaped(X) .*.

Now, I can read over the ECMAScript documentation, find all of the characters which have a special meaning, write a function that escapes them, and be done. But this seems inelegant, inefficient, and error prone -- especially if I want to support all 6 kinds of regular expressions that C++ supports currently, let alone in the future.

Is there a simple way in the standard to escape a literal string to embed in a C++ regular expression, possibly as a function of the regular expression grammar, or do I have to roll my own?

Here is a similar question using the boost library, where the list of escapes is hard-coded, and then a regular expression is generated that backslashes them. Am I reduced to adapting that answer for use in std?

15
  • The answer in How to escape a string for use in Boost Regex is actually what you need. Commented Aug 27, 2015 at 14:08
  • Why do you need to escape it. If X is a string, can't you create your regex from just concatenating, like "^" + X + ".*". Commented Aug 27, 2015 at 14:08
  • @stribizhev which means writing a custom version of it for each of the 6 regular expression formats and any new format that comes along. Commented Aug 27, 2015 at 14:08
  • @GlasG Because X could be "this.is.a.*******.problem" -- and the . and * in that string should not be interpreted as regular expression commands. Commented Aug 27, 2015 at 14:09
  • Yes, exactly. Boost regex won't do it for you. In .NET (e.g. C#, VB.NET), it is clear: use Regex.Escape. In C++, there is no such a function. Commented Aug 27, 2015 at 14:09

2 Answers 2

1

If you have to write your own, there is only two kinds you should need to know.
BRE and the rest.

These should work below. Use the ECMAScript type regex's to operate on the input string.

The below regexs' are formulated using the special characters from here:
What special characters must be escaped in regular expressions?
Under answer Legacy RegEx Flavors (BRE/ERE)

Both use the same replacement: "\\\\$1"

For BRE input:

 # "(\\\\[+?(){}|]|[.^$*\\[\\]\\\\-])" ( # (1 start) \\ [+?(){}|] # not sure this is needed (its not needed) | [.^$*\[\]\\-] ) # (1 end) 

For ERE or ECMAScript input:

 # "([.^$*+?()\\[\\]{}\\\\|-])" ( [.^$*+?()\[\]{}\\|-] ) # (1) 

BRE input example:

Before -

+_)(*&^%$#@!asdfasfd hello + ? ( ) { } | \+ \? \( \) \{ \} \| \\+ \\? \\( \\) \\{ \\} \\| }{":][';/.,<>? here is 

After -

+_)(\*&\^%\$#@!asdfasfd hello + ? ( ) { } | \\+ \\? \\( \\) \\{ \\} \\| \\\\+ \\\\? \\\\( \\\\) \\\\{ \\\\} \\\\| }{":\]\[';/\.,<>? here is 
Sign up to request clarification or add additional context in comments.

Comments

1

(answering quite a while later, so probably OP has worked something out, but still).

A preliminary comment: The regular expression you'll want, in ECMAScript (and may other) syntaxes, is ^X, and you don't need the extra .* afterwards.

As for the approach to this task: You're asking for a general solution for all regex syntax options. Well, YAGNI - You ain't gonna need it. Unless you're writing a general-purpose library supposed to support all C++ regexp syntaxes, don't try to solve the whole world's problems yourself and right away. This is further emphasized by the fact that, since you wrote your question, additional regexp syntax options have been added to C++... so by C++17 it's, um, 10 I think. See here.

So I suggest you write something that is potentially extensible to other syntax options, but only actually works - for now - with the syntax option(s) you need. e.g.:

template <std::regex::syntax_option_type SyntaxOption> std::string escape_for_regex(const std::string_view sv); 

or perhaps

template <std::regex::syntax_option_type SyntaxOption> std::string_view escape_for_regex( const std::string_view source, std::string_view destination ); 

in which the returned string_view indicates how much of the destination you're actually using. One can bike-shed about the signature some more (e.g. perhaps use iterators? ranges?)

and you'll specialize this for std::regex::ECMAScript. The implementation is provided in this SO question:

Is there a RegExp.escape function in JavaScript?

with the answer being that there isn't, but you could add it like so (in Javascript mind you):

RegExp.escape = function(s) { return s.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&'); }; 

moving that to C++, and with our first option for the function signature, this becomes:

template <> std::string escape_for_regex<std::regex::ECMAScript>(const std::string_view sv) { const std::regex to_escape("[-/\\\\^$*+?.()|[\\]{}]"); const std::string escaped("\\$1"); const std::string s{sv}; return std::regex_replace(s, to_escape, escaped); } 

Caveat: Haven't properly tested this. I also don't like the extra string construction, so probably another one of the regex_replace variants might be usable.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.