I'm trying to define a command where the user can define a regex, which should then be used to create a larger regex that in turn will be matched on a token list.
The naive approach to insert it from a token list obviously fails because of a wrong catcode setup:
\documentclass{article}
\usepackage{expl3}
\begin{document}
\ExplSyntaxOn
\tl_set:Nn \l_foo_tl { [a-z]+ }
\regex_const:Nn \l_foo_regex { (\w+)( \[ \u{l_foo_tl} \] ) }
\regex_show:N \l_foo_regex
\seq_new:N \l_foo_seq
\regex_extract_all:NnN \l_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq
\seq_show:N \l_foo_seq
\ExplSyntaxOff
\end{document}
outputs
+-branch
,-group begin
| Match, repeated 1 or more times, greedy
| range [97,122]
| range [65,90]
| range [48,57]
| char code 95
`-group end
,-group begin
| char code 91
| char 91, catcode 12
| char 97, catcode 11
| char 45, catcode 12
| char 122, catcode 11
| char 93, catcode 12
| char 43, catcode 12
| char code 93
`-group end.
What I'd want is either something like
\regex_const:Nn \l_sub_regex { [a-z]+ }
\regex_const:Nn \l_foo_regex { (\w+)( \[ ... \] ) }
where `...` somehow inserts the regex represented by `\l_sub_regex` (both `\c{l_sub_regex}` and `\u{l_sub_regex}` give wrong results here); or a way to convert a compiled regex back to its string representation, something like `\regex_to_str:N`.
Perhaps there's a way to insert it back from a token list using some `\detokenize` or `\scantokens` hackery, but I'm wondering if `l3regex` already provides a proper solution for this.
**EDIT:** I found a note in the `l3regex` documentation about features that are "likely to be implemented at some point in the future":
> Provide a syntax such as `\ur{l_my_regex}` to use an already-compiled regex in a more complicated regex. This makes regexes more easily composable.
So it seems such a feature doesn't currently exist but is planned for the future.
---
(By the way, it would be really helpful if the `\regex_show:` functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of `char code XXX` are harder to debug than necessary.)
---
**EDIT 2:** Unfortunately [the accepted answer](https://tex.stackexchange.com/a/495188/23765) doesn't seems to work correctly when a complete regex is inserted in a capturing group. The following code reproduces the problem:
\regex_set:Nn \l_tmpa_regex { a|b|c }
\cs_show:N \l_tmpa_regex
\regex_set:Nn \l_tmpa_regex { (\y{l_tmpa_regex}) }
\regex_set:Nn \l_tmpb_regex { (a|b|c) }
\regex_show:N \l_tmpa_regex
\regex_show:N \l_tmpb_regex
\cs_show:N \l_tmpa_regex
\cs_show:N \l_tmpb_regex
\regex_extract_once:NnNTF \l_tmpa_regex {a} \l_tmpa_seq {} {}
\seq_show:N \l_tmpa_seq
\regex_extract_once:NnNTF \l_tmpb_regex {a} \l_tmpa_seq {} {}
\seq_show:N \l_tmpa_seq
Both regexes are supposed to match on the token list `a` but only the second one does.
The final regexes in `\l_tmpa_regex` and `\l_tmpb_regex` should be the same, as the two identical `\regex_show:N` outputs suggest:
> Compiled regex variable \l_tmpa_regex:
+-branch
,-group begin
| char code 97 (a)
+-branch
| char code 98 (b)
+-branch
| char code 99 (c)
`-group end.
> Compiled regex variable \l_tmpb_regex:
+-branch
,-group begin
| char code 97 (a)
+-branch
| char code 98 (b)
+-branch
| char code 99 (c)
`-group end.
But the raw internal structure reveals that there is a difference (code shortened a bit):
> \l_tmpa_regex=macro:->
\__regex_branch:n {
\__regex_group:nnnN {
\__regex_branch:n {
<char a>
\__regex_branch:n { <char b> }
\__regex_branch:n { <char c> }
}
}{1}{0}\c_false_bool
}
> \l_tmpb_regex=macro:->
\__regex_branch:n {
\__regex_group:nnnN {
\__regex_branch:n { <char a> }
\__regex_branch:n { <char b> }
\__regex_branch:n { <char c> }
}{1}{0}\c_false_bool
}
If we look at the originally inserted regex, the internal structure seems fine at the start:
> \l_tmpa_regex=macro:->
\__regex_branch:n { <char a> }
\__regex_branch:n { <char b> }
\__regex_branch:n { <char c> }
Especially because the number of `\__regex_branch:n` commands is the same in both result regexes, I don't think this is a limitation of the internal regex structure but of the way the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong place.