best / any way to count escapes in glibc regex(7) / GNU sed / grep / egrep Regular Expressions?

Question

Given bash environment variable settings :

 $ declare -g bs=$'\\' bsbs=$'\\\\' q="'";

This Regular Expression will correctly match a sequence of single quote-("'")-ed text , where such text may contain escaped single quotes:

 "[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]" $ echo "[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]" [\']((([^\\]?[^\'])|(\\\'))+)[\']

(the backtick in "[\']" is not strictly required but is included for clarity, and in case one is trying to encode this value in a singly-quoted string).

The problem lies in how best to generalize this for any escaped quoting character, and how to handle runs of multiple escape sequences ; ONLY if the run of input escape characters is of ODD ((n&1)==1) size (number of bytes), then the last escape is ACTIVE, the last character is INACTIVE (part of string), and otherwise (number of escapes is EVEN ((n&1)==0), then the string contains HALF the number of escapes (n>>1) and the last character is ACTIVE (ie. not escaped) .

Also, in sed and grep / egrep this has some issues :

o The matching sub-groups can occupy subsequent '\1+' group numbers, increasing their number - if any subsequent group does not match -

ideally, I'd like to be able to express that regexp without any subgroups that can possibly affect subsequent sub-group numbers.

o It doesn't handle Number of Escapes at all, and will fail to
recognize that a quote that is proceeded by an EVEN number of escapes is not escaped.

So my question is :

How best to solve these issues using only glibc-supported POSIX REs or grep / sed REs ?

ie. allow arbitrary length sequences of escapes of ODD (effective escape) or EVEN (ineffective escape) length to be recognized inside RegExps ?

I really think POSIX REs could benefit from special syntax to handle such questions, like:

 [\\]{1,}\#&1\?$A\:$B

Where '}#&1' means the test 'x & 1' on the number of elements matched by previous [\]{...} group, and ?x:y means "if last test is true, substitute x, otherwise y in RE".

Then one could actually easily express this and safely handle any number of escapes in RegExp parsed strings . How to do that without some new RE syntax like this ?

Very difficult, if not impossible / infeasible , with RegExp exprs alone.

Or am I wrong ?

Is there now an easy way to do arithmetic on run length of previous group in modern POSIX REs ?

Example 1 :

$ declare -g bs=$'\\' bsbs=$'\\\\' q="'"; $ echo "'a quot\\'d string' 42" | sed -r 's/'"[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/' 'a quot'd string : g

Example 2 :

$ echo "'a quot\\'d string' 42" | sed -r 's/'"[${q}]((([^${bsbs}]?[^${q}])|(${bsbs}${q}))+)[${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/' a quot\'d string : g

note how the ${bs}-es @rowboat mentioned are removed, and still the same result , as would using only $bs, not $bsbs :

$ echo "'a quot\\'d string' 42" | sed -r 's/'"[${q}]((([^${bs}]?[^${q}])|(${bs}${q}))+)[${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/' a quot\'d string : g

Conclusion :

I am developing non-POSIX extensions to the "regex(7) - POSIX.2 regular expressions" library, provided by glibc , and to PCRE, and to PERL, and to cl-ppcre (SBCL's Common Lisp RE library) , and to Emacs's RE library for :

o defining a meaning for any named POSIX character class when suffixed by '-esc' or 'esc', eg. '[[:spaceesc:]]' or '[^[:space-esc:]]' or '[[:quote-esc:]]' , which means: A character that is ordinarily a member of character class 'X', is not a member of character class "${X}esc" (a synonym of "${X}-esc") IFF it is preceded by an ODD NUMBER of Escape Characters ('\':ASCII "\x5c").

 All character sequences that are subject to an :*esc: character class test will have legal '\\' , '\xXX', '\0OOO', or '\Uxxxxxx' or '\uXXXX' sequences replaced by : ASCII:\x5c , ASCII:\xXX (where XX are hex digits), ASCII:\OOO (where OOO are Octal digits) , 24-bit unicode value with code point xxxxxx (x: hex digit) , and 16-bit unicode value with code point xxxx (x: hex digit) , respectively. Also '[[:quote:]]' and '[[:quoteesc:]]' classes must be supported that select characters (or non-escaped chars) with the Unicode 'Quotation Mark' binary attribute, and '[[:punct:]]' or '[[:punctesc:]]' would similarly apply to all (non-escaped) chars which have the Punctuation attribute. Perhaps a similar '*cesc' or '*escc' character class suffixes could be provided that support also the C escapes: '\n','\r','\t','\v','\b','\l'... etc. If the /𝕦 (\U1D566) flag is specified / UNICODE_NAMES flag, then Unicode Names can also be specified : \U1D566 == \U{MATHEMATICAL DOUBLE-STRUCK SMALL U} == \U{MATHEMATICAL_DOUBLE_STRUCK_SMALL_U} . There is no point in doing such an exercise unless UTF-8 names also are supported, IMHO . There no point in just handling escaped spaces if full escape handling is not also enable-able somehow or comes along with it. Actually, the SBCL Pure-Common-Lisp implementation is about the speediest and nicest to use amongst ANY RE implementation I have used , and already supports escaped classes & Unicode Names. The LIBC regex and glob implementations are EXTRA-ORDINARILY SLOW! This slows down BASH and command-line tools and all tools that use the POSIX RE library, such as Flex / Bison / Yacc, tremendously . Perhaps either : A) Techniques used in SBCL PPCRE, libppcre, and PERL RE library can be ported to LIBC Regex library, in a new 'Fast Regex' replacement that can optionally replace old implementation on demand ; B) LIBC RE library can be made to transparently replace itself with libppcre or to a connection to a running SBCL instance with CL-PPCRE loaded, or to PERL with full PERL RE support, to support a UNIX CMSG Message API for Compiling, Match Against String, or Match against FD / stream API, and for Retrieval of Match Numbered N or with Name N, where N is in a set of Group Names or Numbers sent in advance as identifying parenthesis groups, and which can contain multiple dimensions (numbers in square brackets) to denote sub-expressions. Also I think that supporting a char-class LENGTH test, of the form: ']{x,}\#<test>\?<A>\:<B>' , meaning: " If number of characters in character-class just closed satisfies test <test> , then RE fragment <A> is parsed / takes effect, else RE <B> takes effect. ", would be very useful - for <test> in: {=X,>X,<X,<=X,>=X,&X,|X,^X,&~X,|~X} , where X is a decimal number. But first, I am working on the escaped char-classes support.

I can't understand why no-one seems to understand what I was suggesting with this question.

I hope the above makes things clearer.

I think I'm going to have to develop & submit some sort of patch to glibc Regexps to be able to have, as a GNU extension, some kind of Previous { .. } Group Number of Elements Referral, ('}\#') and Test / Arithmetic, and Conditional Expressions inside REs - I could really use such a thing right now . — JVD
– JVD, Commented Dec 23, 2023 at 15:10
Please edit your question and add some example inputs and what you are expecting as output. Show us what kind of things you are trying to match. It is extremely hard to parse that regex without such examples. — terdon
– terdon ♦, Commented Dec 23, 2023 at 15:11
Also, you mention "POSIX RE" (which presumably means Basic Regular Expressions, BRE) but also mention GNU grep which supports Extended Regular Expressions (ERE) but also Perl Compatible Regular Expressions (PCRE). Please clarify which regex language you are actually interested in. I am pretty sure you know significantly more about this than I do, but I still think you could clarify a bit and it isn't just my own ignorance that is confusing me. — terdon
– terdon ♦, Commented Dec 23, 2023 at 15:13
I can't make much sense of what you're asking. Maybe you can start by clarifying what problem you're trying to solve. — Stéphane Chazelas
– Stéphane Chazelas, Commented Dec 23, 2023 at 15:39

Stéphane Chazelas · Accepted Answer · 2023-12-23 17:14:18Z

If the point is to tokenise shell code the same way the shell language interpreter does, regexps will get you nowhere.

The zsh shell exposes its tokeniser with the z parameter expansion flag (or Z which can take options to process comments or change the treatment of newline) which you can combine with the Q parameter expansion to do quote removal.

For instance:

tokens() printf ' - « %s »\n' ${(Z[Cn])1} tokens_dequoted() printf ' - « %s »\n' "${(@Q)${(Z[Cn])1}}"

Would report all the shell tokens in its first argument, dropping comments; the second one also removing one layer of quoting:

$ tokens ' foo "a b"; "" "$(echo "x y")" <<'"'qwe '\''qwe' #qwe" - « foo » - « "a b" » - « ; » - « "" » - « "$(echo "x y")" » - « << » - « 'qwe '\''qwe' » $ tokens_dequoted ' foo "a b"; "" "$(echo "x y")" <<'"'qwe '\''qwe' #qwe" - « foo » - « a b » - « ; » - « » - « $(echo "x y") » - « << » - « qwe 'qwe »

You can see that in order to do the same, you need to implement a full shell parser.

You can get somewhere with regexp if you reduce the scope: only consider '...', "...", and \ types of quotes (not $'...'), and only whitespace as delimiter and ignore expansions inside double quotes. In bash 4.4+, which contrary to zsh can't handle NUL bytes in its code anyway, and with GNU grep, you can do:

tokens() { local tokens readarray -td '' tokens < <(printf %s "$1" | grep -Ezo '(\\.|[^[:space:]\\"'\'']|'\''[^'\'']*'\''|"(\\.|[^\\"])*")+' ) printf ' - « %s »\n' "${tokens[@]}" }

Then:

$ tokens ' foo "a b"\c\\\" c\ d '" 'qwe'\''qwe'\"'\"qwe" - « foo » - « "a b"\c\\\" » - « c\ d » - « 'qwe'\''qwe'"'"qwe »

To remove one layer of quoting from that, I'd resort to perl (or zsh which can do that out of the box as seen above).

I am NOT interested in getting programs other than sed / grep / POSX extended REs to work. Incidentally : $ tokens_dequoted() printf ' - « %s »\n' "${(@Q)${(Z[Cn])1}}" $ IFS=$'\x09'; tokens_dequoted "'A \' B\'"$'\x09'42; unset IFS; - « A \ » - « B' » - « 42 » $ which is not what we want. Yes, I know about readarray - it has same problem using delim $'\x09' and quotes . I'd like to work on making an extension to glibc regexps and globs that handle these Quoted String cases, and also correctly allow the delimiter itself (eg. $'\x09' ) to be escaped , as well as 's. — JVD
– JVD, Commented Dec 23, 2023 at 22:52

JVD · Accepted Answer · 2023-12-25 03:02:18Z

-3

A better answer: use pcre / PERL RegExps :

$ cat a.pcre /^[']((?|(?:[^\\]?[^'\t\n\r])|(?:[\\]['\t\n\r]))*)[']\t((?|(?:[^\\]?[^\t])|(?:[\\][^\t\n\r]))+)/ 'A quot\'d\ tab containing string' 42 $ pcretest < a.pcre PCRE version 8.45 2021-06-15 re> data> 0: 'A quot'd\x09tab containing string'\x0942 1: A quot'd\x09tab containing string 2: 42 data>

answered Dec 25, 2023 at 3:02

JVD

1153 bronze badges

You said "I am NOT interested in getting programs other than sed / grep / POSX extended REs to work." and then posted this answer using a program other than sed / grep / POSX extended REs.

Ed Morton
– Ed Morton

2023-12-25 15:13:34 +00:00
Commented Dec 25, 2023 at 15:13
Yes, but this was just an illustratory / exemplary answer, AFTER I had reached the final conclusion "No, there is no built-in way in GLIBC POSIX or in GNU grep / sed REs to handle an even/odd number of escape characters properly" .

JVD
– JVD

2023-12-26 18:02:20 +00:00
Commented Dec 26, 2023 at 18:02
After thinking about this for the last 3 days or so, I have reached the conclusion to develop a patch for both glibc and libc6 and grep and sed, that , for every POSIX character class, as a new GNU extension that can be enabled with the '/E' "Provide Escaped Char.Classes" modifier flag, will also provide an 'esc' suffixed variant of the char class that won't treat chars preceded by an ODD number of escapes as being part of the class, and maybe which will also have the sort of "number of elements test and conditional expression" logic described above.

JVD
– JVD

2023-12-26 18:03:00 +00:00
Commented Dec 26, 2023 at 18:03
The PCRE answer is a non-answer to this specific question, but illustrates the sort of solution that would be acceptable, if only we could do something as simple or simpler in POSIX / grep / sed RegExps , but we can't.

JVD
– JVD

2023-12-26 18:06:24 +00:00
Commented Dec 26, 2023 at 18:06
What's the point if you can already do whatever it is you want to do using perl? FWIW I still don't know what it is you're trying to do. If you provided sample input/output in your question that demonstrated all your requirements you may get an answer using existing POSIX tools and maybe it's not necessary to use a single regexp.

Ed Morton
– Ed Morton

2023-12-26 18:07:58 +00:00
Commented Dec 26, 2023 at 18:07

| Show 11 more comments

Stack Exchange Network

best / any way to count escapes in glibc regex(7) / GNU sed / grep / egrep Regular Expressions?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

best / any way to count escapes in glibc regex(7) / GNU sed / grep / egrep Regular Expressions?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions