Given bash environment variable settings :
$ declare -g bs=$'\\' bsbs=$'\\\\' q="'"; This Regular Expression will correctly match a sequence of single quote-("'")-ed text , where such text may contain escaped single quotes:
"[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]" $ echo "[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]" [\']((([^\\]?[^\'])|(\\\'))+)[\'] (the backtick in "[\']" is not strictly required but is included for clarity, and in case one is trying to encode this value in a singly-quoted string).
The problem lies in how best to generalize this for any escaped quoting character, and how to handle runs of multiple escape sequences ; ONLY if the run of input escape characters is of ODD ((n&1)==1) size (number of bytes), then the last escape is ACTIVE, the last character is INACTIVE (part of string), and otherwise (number of escapes is EVEN ((n&1)==0), then the string contains HALF the number of escapes (n>>1) and the last character is ACTIVE (ie. not escaped) .
Also, in sed and grep / egrep this has some issues :
o The matching sub-groups can occupy subsequent '\1+' group numbers, increasing their number - if any subsequent group does not match -
- ideally, I'd like to be able to express that regexp without any subgroups that can possibly affect subsequent sub-group numbers.
o It doesn't handle Number of Escapes at all, and will fail to
recognize that a quote that is proceeded by an EVEN number of escapes is not escaped.
So my question is :
How best to solve these issues using only glibc-supported POSIX REs or grep / sed REs ?
ie. allow arbitrary length sequences of escapes of ODD (effective escape) or EVEN (ineffective escape) length to be recognized inside RegExps ?
I really think POSIX REs could benefit from special syntax to handle such questions, like:
[\\]{1,}\#&1\?$A\:$B Where '}#&1' means the test 'x & 1' on the number of elements matched by previous [\]{...} group, and ?x:y means "if last test is true, substitute x, otherwise y in RE".
Then one could actually easily express this and safely handle any number of escapes in RegExp parsed strings . How to do that without some new RE syntax like this ?
Very difficult, if not impossible / infeasible , with RegExp exprs alone.
Or am I wrong ?
Is there now an easy way to do arithmetic on run length of previous group in modern POSIX REs ?
Example 1 :
$ declare -g bs=$'\\' bsbs=$'\\\\' q="'"; $ echo "'a quot\\'d string' 42" | sed -r 's/'"[${bs}${q}]((([^${bsbs}]?[^${bs}${q}])|(${bsbs}${bs}${q}))+)[${bs}${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/' 'a quot'd string : g Example 2 :
$ echo "'a quot\\'d string' 42" | sed -r 's/'"[${q}]((([^${bsbs}]?[^${q}])|(${bsbs}${q}))+)[${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/' a quot\'d string : g note how the ${bs}-es @rowboat mentioned are removed, and still the same result , as would using only $bs, not $bsbs :
$ echo "'a quot\\'d string' 42" | sed -r 's/'"[${q}]((([^${bs}]?[^${q}])|(${bs}${q}))+)[${q}]"'[[:space:]]([0-9]+)/\1\t:\t\2/' a quot\'d string : g Conclusion :
I am developing non-POSIX extensions to the "regex(7) - POSIX.2 regular expressions" library, provided by glibc , and to PCRE, and to PERL, and to cl-ppcre (SBCL's Common Lisp RE library) , and to Emacs's RE library for :
o defining a meaning for any named POSIX character class when suffixed by '-esc' or 'esc', eg. '[[:spaceesc:]]' or '[^[:space-esc:]]' or '[[:quote-esc:]]' , which means: A character that is ordinarily a member of character class 'X', is not a member of character class "${X}esc" (a synonym of "${X}-esc") IFF it is preceded by an ODD NUMBER of Escape Characters ('\':ASCII "\x5c").
All character sequences that are subject to an :*esc: character class test will have legal '\\' , '\xXX', '\0OOO', or '\Uxxxxxx' or '\uXXXX' sequences replaced by : ASCII:\x5c , ASCII:\xXX (where XX are hex digits), ASCII:\OOO (where OOO are Octal digits) , 24-bit unicode value with code point xxxxxx (x: hex digit) , and 16-bit unicode value with code point xxxx (x: hex digit) , respectively. Also '[[:quote:]]' and '[[:quoteesc:]]' classes must be supported that select characters (or non-escaped chars) with the Unicode 'Quotation Mark' binary attribute, and '[[:punct:]]' or '[[:punctesc:]]' would similarly apply to all (non-escaped) chars which have the Punctuation attribute. Perhaps a similar '*cesc' or '*escc' character class suffixes could be provided that support also the C escapes: '\n','\r','\t','\v','\b','\l'... etc. If the /𝕦 (\U1D566) flag is specified / UNICODE_NAMES flag, then Unicode Names can also be specified : \U1D566 == \U{MATHEMATICAL DOUBLE-STRUCK SMALL U} == \U{MATHEMATICAL_DOUBLE_STRUCK_SMALL_U} . There is no point in doing such an exercise unless UTF-8 names also are supported, IMHO . There no point in just handling escaped spaces if full escape handling is not also enable-able somehow or comes along with it. Actually, the SBCL Pure-Common-Lisp implementation is about the speediest and nicest to use amongst ANY RE implementation I have used , and already supports escaped classes & Unicode Names. The LIBC regex and glob implementations are EXTRA-ORDINARILY SLOW! This slows down BASH and command-line tools and all tools that use the POSIX RE library, such as Flex / Bison / Yacc, tremendously . Perhaps either : A) Techniques used in SBCL PPCRE, libppcre, and PERL RE library can be ported to LIBC Regex library, in a new 'Fast Regex' replacement that can optionally replace old implementation on demand ; B) LIBC RE library can be made to transparently replace itself with libppcre or to a connection to a running SBCL instance with CL-PPCRE loaded, or to PERL with full PERL RE support, to support a UNIX CMSG Message API for Compiling, Match Against String, or Match against FD / stream API, and for Retrieval of Match Numbered N or with Name N, where N is in a set of Group Names or Numbers sent in advance as identifying parenthesis groups, and which can contain multiple dimensions (numbers in square brackets) to denote sub-expressions. Also I think that supporting a char-class LENGTH test, of the form: ']{x,}\#<test>\?<A>\:<B>' , meaning: " If number of characters in character-class just closed satisfies test <test> , then RE fragment <A> is parsed / takes effect, else RE <B> takes effect. ", would be very useful - for <test> in: {=X,>X,<X,<=X,>=X,&X,|X,^X,&~X,|~X} , where X is a decimal number. But first, I am working on the escaped char-classes support. I can't understand why no-one seems to understand what I was suggesting with this question.
I hope the above makes things clearer.