14

I have been looking for a way to match balanced parenthesis in a regex and found a way in Perl, that uses a recursive regular expression:

my $re; $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; 

from the perl regular expression site .

Is there a way to do this in Ruby or a similar language?

UPDATE:

For those interested here are some interesting links:

Oniguruma manual - from Sawa's answer.

Pragmatic Programmers' Ruby 1.9 Regular Expressions Sample Chapter

2
  • Oniguruma documentation link is not available anymore. Better source is probably: github.com/kkos/oniguruma/blob/master/doc/RE Commented Apr 29, 2020 at 15:18
  • I'm constantly amazed by the fact that language designers fail to learn from Perl - an amazing language with a somewhat weird syntax. Commented Jun 28, 2022 at 18:07

2 Answers 2

21

Yes. With oniguruma regex engine, which is built in in Ruby 1.9, and is installable on Ruby 1.8, you can do that. You name a subregex with (?<name>...) or (?'name'...). Then you call a subregex with \g<name> or \g'name' within the same regex. So your regex translated to oniguruma regex will be:

re = %r{ (?<re> \( (?: (?> [^()]+ ) | \g<re> )* \) ) }x 

Also note that multi-byte string module in PHP >=5 uses oniguruma regex engine, so you will be able to do the same.

The manual for oniguruma is here.

Sign up to request clarification or add additional context in comments.

9 Comments

Then somebody better update the Wikipedia page. Still, I much prefer this simpler answer of using \((?:[^()]*+|(?0))*\).
@tchrist I just changed recursion, lookbehind, atomic groups, named capture, and unicode property support to yes for Ruby in the feature comparison tables, and removed the note that it was for Ruby 1.8.
@sawa: Good deed! But you’d best change Uɴɪᴄᴏᴅᴇ prop support to Some: Oniguruma supports only Gᴇɴᴇʀᴀʟ_Cᴀᴛᴇɢᴏʀʏ + a few Sᴄʀɪᴘᴛs. Uɴɪᴄᴏᴅᴇ prop support has 4⁺ levels: ⑴ all 11 props required by RL1.2; ⑵ compat props like on \w &ᶜ per RL1.2A; ⑶ named chars like \N{POUND SIGN} per RL2.5; and ⑷ full support of all props per RL2.7. Perl&ICU meet all ④; Ruby meets ⓪.
@sawa Gladly. Oniguruma has a many interesting features, but it also has major blunders, like dealing with physical serializations (encodings) at the regex level instead of always in virtual code points. This is a major headache, and a major violation: Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. See?
@sawa eventhough your answer was very helpful you should probably do this small correction: change (??\g<re>) to (\g<re>) as the (?? {}) is the perl dynamic recursion operator (or something like that).
|
1

I like the above solution but frequently one wishes to ignore escaped characters. Assuming that \ escapes the following character the following regex handles escaped characters as well.

ESC= /(?<![\\])(?>[\\](?:[\\][\\])*)/ UNESC= /(?:\A|(?<=[^\\]))(?:[\\][\\])*/ BALANCED_PARENS = /#{UNESC}( (?<bal>\( (?> (?> (?:#{ESC}\(|#{ESC}\)|[^()])+ ) |\g<bal> )* \)) ) /xm 

Given the limitations of negative lookbehind the part delimited by matching parens will be the first capture not the whole match (the whole match may contain leading escaped backslashes).

The reason for the complexity of ESC and UNESC is the assumption that a \\ is an escaped backslash. We only use the UNESC sequence before the initial paren match since any other escaped parenthesis will be matched inside the atomic group and never backtracked. Indeed, if we tried to use the UNESC prefix for either an interior or final paren match it would fail when [^()] inside the atomic group matched the leading \'s and refused to backtrack.

This regex will scan for the first paren that delimits a validly balanced parenthetical. Thus, given the string " ( ( stuff )" it will match "( stuff )". Frequently, the desired behavior is to locate the first (unescaped) parenthesis and either match the interior (if balanced) or fail to match. Unfortunately, atomic grouping won't stop the entire regex from being backed out of and a match attempted at a later point so we must anchor at the start of the string and only look at the 1st capture. The following regex makes this change:

BALANCED_PARENS = /\A(?:#{ESC}\(|#{ESC}\)|[^()])*+ (?<match>\( (?<bal> (?> (?> (?:#{ESC}\(|#{ESC}\)|[^()])+ ) |\(\g<bal> )* \)) ) /xm 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.