Regex differences between PCRE and PCRE2

Question

We're considering moving from PCRE to PCRE2 as our internal regex engine. Only the regex syntax itself is exposed to our users, so the library APIs differences are not an issue to our uses. However, we will have to document any change in behaviour.

Plenty of websites discuss the API differences, but I've not found any that list practical differences there in the regex symtax. While I do know that [\w-_] means the same as [\w\-_] in PCRE but is invalid in PCRE2, I suspect other differences exist.

In what ways do the regexes of PCRE2 differ from those of PCRE?

There's also: recursions are no more atomic, the behaviour of \K inside lookarounds. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Dec 8, 2021 at 10:07
This needs more attention now that GNU grep 3.8 (via -P) and PHP 7.3 have moved from PCRE to PCRE2. — Adam Katz
– Adam Katz, Commented Sep 6, 2022 at 21:51
The best resources I can find to answer this are the PCRE2 changelog (10.00, the oldest entry, marks the initial PCRE2 release) and the pcre2compat man page, which describes the differences between PCRE2 and Perl's engine. — Adam Katz
– Adam Katz, Commented Sep 6, 2022 at 22:05
@AdamKatz OK, then someone just need to compile the differences from those sources, and make it an answer here; and we will have a truly useful resource on SO. — Adám
– Adám, Commented Sep 7, 2022 at 4:02
Yes, that was my thinking as well, though it would be ideal to also find a document listing the differences between Perl and PCRE. There's still some teasing to be done to find the PCRE→PCRE2 differences since they'll be (theoretically) smaller than the Perl→PCRE2 differences. — Adam Katz
– Adam Katz, Commented Sep 7, 2022 at 14:37

Adám · Accepted Answer · 2024-05-03 08:39:56Z

Compiled differences between PCRE v8.36 and PCRE2 10.39

I have compiled a list of changes that are possible issues one could encounter when converting from pcre to pcre2. I have excluded various overflows, underflows, segmentation violations, and assorted errors the pattern could encounter in pcre.

Pcre2 has a version checking pattern. You may check the version in applications with /(?(VERSION>=10)yes|no)/ matching against "yesno".

Possible breaking changes:

Patterns such as /()a/ failed to set the "first character must be 'a'" information. For example /(?:(?=.)|(?<!x))a/.
Patterns such as /a\K.(?0)*/ matching against "abac" found "bac" when Perl and JIT found "c". The effects of \K was not being propagated correctly. Not all uses of \K produced incorrect results.
Use of (*ACCEPT) did not unset other group captures, leaving the ovector containing incorrect information. For example /(x)|((*ACCEPT))/ matched against "abcd".
For a pattern similar to /(?i)[A-`]/ in UTF mode and mixed case could leave ranges out of the class, in this case a-j was left out.
An assertion optimized to (*FAIL) when used as a condition. For example (?(?!)a|b).
For \8 and \9, now match Perl. They are either a back reference, or the literal characters "8" and "9".
Report an error for an empty sub-pattern name such as (?'').
A repeating non-capturing group with conditional groups that matched empty strings failed to be identified as matching the empty string. For example /^(?:(?(1)x|)+)+$()/.
Various breaking changes for EBCDIC environments.
PCRE2 with Unicode support enabled did not report an error when using \p and \P in a class.
Possessively repeated conditional groups that may match empty strings were incorrectly compiled. For example /(?(R))*+/.
Sequences such as [[:punct:]b] disregarded the POSIX classes if a single character followed.
In UCP mode, [:punct:] matched characters in 128-255 that should not have matched.
Negated classes such as [^[:^ascii:]\d] and non-negated classes of [:^ascii:] or [:^xdigit:] incorrectly included all code points greater than 255.
Setting any of the (?imsxJU) options at the start of a pattern is no longer transferred to the options that are returned by PCRE2_INFO_ALLOPTIONS.
Having \Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored.
An empty \Q\E sequence may appear after a callout preceding an assertion condition, however it is ignored.
You may now use {0} after a group in a lookbehind assertion.
PCRE2 now matches perl in treating (?(DEFINE)...) as a "define" group, even when a group named "DEFINE" exists.
Recursion condition tests must now refer to existing sub-patterns. For example (?(R2)...).
Use of conditional recursion test misbehaved if a group name began with "R". For example (?(R)...).
A hyphen immediately after a POSIX character class deviates from Perl. It is allowed as a literal, but PCRE2 now generates an error.
Patterns like (?=.*X)X$ were incorrectly optimized as if they required an initial 'X' and a following 'X'.
Assertion starting with .* were incorrectly optimized to require matching at the start of the subject or after a newline. Some cases were not true, for example (?=.*[A-Z])(?=.{8,16})(?!.*[\s]).
If the only branch in a conditional sub-pattern is anchored, the whole sub-pattern will incorrectly be treated as anchored. For example /(?(1)^())b/ or /(?(?=^))b/.
A pattern starting with a subroutine call and a quantifier minimum of zero, will incorrectly set "match must start with this character". For example: /(?&xxx)*ABC(?<xxx>XYZ)/ would expect 'A' to be the first character.
Upstream News.
PHP 7.3 PCRE Migration notes.

Collectives™ on Stack Overflow

Regex differences between PCRE and PCRE2

1 Answer 1

Compiled differences between PCRE v8.36 and PCRE2 10.39

Possible breaking changes:

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Compiled differences between PCRE v8.36 and PCRE2 10.39

Possible breaking changes:

Comments

Related