How can I match a quote-delimited string with a regex?

Question

If I'm trying to match a quote-delimited string with a regex, which of the following is "better" (where "better" means both more efficient and less likely to do something unexpected):

/"[^"]+"/ # match quote, then everything that's not a quote, then a quote

or

/".+?"/ # match quote, then *anything* (non-greedy), then a quote

Assume for this question that empty strings (i.e. "") are not an issue. It seems to me (no regex newbie, but certainly no expert) that these will be equivalent.

Update: Upon reflection, I think changing the + characters to * will handle empty strings correctly anyway.

Etienne Neveu · Accepted Answer · 2016-06-24 09:51:53Z

You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:

"[^"]*"!

or:

".*?"!

The difference appears when you have the subject:

"one" "two"!

The first regex matches:

"two"!

while the second regex matches:

"one" "two"!

Always be as specific as you can. Use the negated character class when you can.

Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.

As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.

This counterexample is exactly what I was looking for when I wrote the question. Thanks Jan

nico · Accepted Answer · 2013-06-21 09:17:28Z

More complicated, but it handles escaped quotes and also escaped backslashes (escaped backslashes followed by a quote is not a problem)

/(["'])((\\{2})*|(.*?[^\\](\\{2})*))\1/

Examples:
"hello\"world" matches "hello\"world"
"hello\\"world" matches "hello\\"

how would you then remove / fix the \\? or any of the backslash protected characters from the resultant string?

Harold Bamford · Accepted Answer · 2008-12-17 18:37:20Z

I would suggest:

([\"'])(?:\\\1|.)*?\1

But only because it handles escaped quote chars and allows both the ' and " to be the quote char. I would also suggest looking at this article that goes into this problem in depth:

http://blog.stevenlevithan.com/archives/match-quoted-string

However, unless you have a serious performance issue or cannot be sure of embedded quotes, go with the simpler and more readable:

/".*?"/

I must admit that non-greedy patterns are not the basic Unix-style 'ed' regular expression, but they are getting pretty common. I still am not used to group operators like (?:stuff).

What's the "(:?stuff)" you mention supposed to mean? I know "(?:stuff)", but not the other.
In Perl, these are called "Extended Patterns". Check out perldoc.perl.org/perlre.html under "Extended Patterns" (about 1/3 of the way down). In this case, it is just like (stuff) except there is no capture ($1 or \1) involved.
It was. I just now fixed it. I didn't understand the original comment (I thought there was some obscure escaping bug in SO). Sorry for the confusion.
No worries. ;-) Essentially, you can delete this whole dialog as it has no substance anymore.

Leon Timmermans · Accepted Answer · 2008-12-17 16:45:48Z

6

I'd say the second one is better, because it fails faster when the terminating " is missing. The first one will backtrack over the string, a potentially expensive operation. An alternative regexp if you are using perl 5.10 would be /"[^"]++"/. It conveys the same meaning as version 1 does, but is as fast as version two.

edited Dec 17, 2008 at 16:45

answered Dec 17, 2008 at 16:38

Leon Timmermans

30.3k2 gold badges65 silver badges111 bronze badges

8 Comments

innaM Over a year ago

Why is the second one able to fail faster?

Leon Timmermans Over a year ago

I added the explanation a few seconds before you asked. It the second one doesn't backtrack.

Leon Timmermans Over a year ago

If you want to get really fancy and support escaped quotes in the regexp, you could do this: /"(?:[^"]|(?<!\)(?>\\\)*\\")++"/. I've explained that one at stackoverflow.com/questions/56554/…

Tomalak Over a year ago

Good point regarding the backtracking. For my "long running loop" example, this would surely cause more hurt than the lazy quantifier. Also good that you mentioned possessive quantifiers. +1

Jan Goyvaerts Over a year ago

Leon is wrong about the backtracking. .*? backtracks for each and every character in the string, when the closing " is present. When the closing " is missing, both regexes backtrack.

|

PEZ · Accepted Answer · 2008-12-17 16:27:23Z

4

I'd go for number two since it's much easier to read. But I'd still like to match empty strings so I would use:

/".*?"/

answered Dec 17, 2008 at 16:27

PEZ

17k7 gold badges47 silver badges66 bronze badges

1 Comment

slf Over a year ago

@Graeme Perrow: .*? is the defacto standard of non greedy matching

Tomalak · Accepted Answer · 2008-12-17 17:27:17Z

From a performance perspective (extremely heavy, long-running loop over long strings), I could imagine that

"[^"]*"

is faster than

".*?"

because the latter would do an additional check for each step: peeking at the next character. The former would be able to mindlessly roll over the string.

As I said, in real-world scenarios this would hardly be noticeable. Therefore I would go with number two (if my current regex flavor supports it, that is) because it is much more readable. Otherwise with number one, of course.

Trochee · Accepted Answer · 2008-12-29 18:29:22Z

Using the negated character class prevents matching when the boundary character (doublequotes, in your example) is present elsewhere in the input.

Your example #1:

/"[^"]+"/ # match quote, then everything that's not a quote, then a quote

matches only the smallest pair of matched quotes -- excellent, and most of the time that's all you'll need. However, if you have nested quotes, and you're interested in the largest pair of matched quotes (or in all the matched quotes), you're in a much more complicated situation.

Luckily Damian Conway is ready with the rescue: Text::Balanced is there for you, if you find that there are multiple matched quote marks. It also has the virtue of matching other paired punctuation, e.g. parentheses.

Douglas Leeder · Accepted Answer · 2008-12-17 16:32:02Z

I prefer the first regex, but it's certainly a matter of taste.

The first one might be more efficient?

Search for double-quote add double-quote to group for each char: if double-quote: break add to group add double-quote to group

Vs something a bit more complicated involving back-tracking?

Paul Tomblin · Accepted Answer · 2008-12-17 16:40:27Z

0

Considering that I didn't even know about the "*?" thing until today, and I've been using regular expressions for 20+ years, I'd vote in favour of the first. It certainly makes it clear what you're trying to do - you're trying to match a string that doesn't include quotes.

answered Dec 17, 2008 at 16:40

Paul Tomblin

184k59 gold badges324 silver badges412 bronze badges

4 Comments

Graeme Perrow Over a year ago

I've been using regular expressions for many years too, and I knew there was a way to do things in a non-greedy way, but didn't realize how easy it was until today. That's why I asked - I find it easier to read (now that I know what it means), so is there a reason I shouldn't use it?

Tomalak Over a year ago

The only limitation is in your regex engine. There's the remote possibility that you are facing one that does not support non-greedy quantifiers. Modern ones generally do.

PEZ Over a year ago

Some regex engines supports changing the default greediness (making .* be nongreedy and .*? be greedy). In PHP you can use the U regex modifier for this. I have had use of that when scraping html.

Jan Goyvaerts Over a year ago

PEZ: I strongly recomment you use /.*?/ instead of /.*/U Most people will recognize .*? as a lazy quantifier, or at least as something they don't know. A /U tucked away at the end of the regex is easily missed. It's about keeping your code human-readable.

Collectives™ on Stack Overflow

How can I match a quote-delimited string with a regex?

9 Answers 9

1 Comment

1 Comment

4 Comments

8 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

1 Comment

1 Comment

4 Comments

8 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Linked

Related