Revision 92918d4d-b64f-4c32-a1d7-77a8ce100e6f - TeX

I'm trying to define a command where the user can define a regex, which should then be used to create a larger regex that in turn will be matched on a token list.

The naive approach to insert it from a token list obviously fails because of a wrong catcode setup:

 \documentclass{article}
 \usepackage{expl3}
 
 \begin{document}
 \ExplSyntaxOn
 
 \tl_set:Nn \l_foo_tl { [a-z]+ }
 \regex_const:Nn \l_foo_regex { (\w+)( \[ \u{l_foo_tl} \] ) }
 \regex_show:N \l_foo_regex
 
 \seq_new:N \l_foo_seq
 \regex_extract_all:NnN \l_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq
 \seq_show:N \l_foo_seq

 \ExplSyntaxOff
 \end{document}

outputs

 +-branch
 ,-group begin
 | Match, repeated 1 or more times, greedy
 | range [97,122]
 | range [65,90]
 | range [48,57]
 | char code 95
 `-group end
 ,-group begin
 | char code 91
 | char 91, catcode 12
 | char 97, catcode 11
 | char 45, catcode 12
 | char 122, catcode 11
 | char 93, catcode 12
 | char 43, catcode 12
 | char code 93
 `-group end.

What I'd want is either something like

 \regex_const:Nn \l_sub_regex { [a-z]+ }
 \regex_const:Nn \l_foo_regex { (\w+)( \[ ... \] ) }

where `...` somehow inserts the regex represented by `\l_sub_regex` (both `\c{l_sub_regex}` and `\u{l_sub_regex}` give wrong results here); or a way to convert a compiled regex back to its string representation, something like `\regex_to_str:N`.

Perhaps there's a way to insert it back from a token list using some `\detokenize` or `\scantokens` hackery, but I'm wondering if `l3regex` already provides a proper solution for this.

**EDIT:** I found a note in the `l3regex` documentation about features that are "likely to be implemented at some point in the future":

> Provide a syntax such as `\ur{l_my_regex}` to use an already-compiled regex in a more complicated regex. This makes regexes more easily composable.

So it seems such a feature doesn't currently exist but is planned for the future.

---

(By the way, it would be really helpful if the `\regex_show:` functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of `char code XXX` are harder to debug than necessary.)

---

**EDIT 2:** Unfortunately [the accepted answer](https://tex.stackexchange.com/a/495188/23765) doesn't seems to work correctly when a complete regex is inserted in a capturing group. The following code reproduces the problem:

 \regex_set:Nn \l_tmpa_regex { a|b|c }
 \cs_show:N \l_tmpa_regex
 
 \regex_set:Nn \l_tmpa_regex { (\y{l_tmpa_regex}) }
 \regex_set:Nn \l_tmpb_regex { (a|b|c) }
 \regex_show:N \l_tmpa_regex
 \regex_show:N \l_tmpb_regex
 \cs_show:N \l_tmpa_regex
 \cs_show:N \l_tmpb_regex
 
 \regex_extract_once:NnNTF \l_tmpa_regex {a} \l_tmpa_seq {} {}
 \seq_show:N \l_tmpa_seq
 \regex_extract_once:NnNTF \l_tmpb_regex {a} \l_tmpa_seq {} {}
 \seq_show:N \l_tmpa_seq

Both regexes are supposed to match on the token list `a` but only the second one does.

The final regexes in `\l_tmpa_regex` and `\l_tmpb_regex` should be the same, as the two identical `\regex_show:N` outputs suggest:

 > Compiled regex variable \l_tmpa_regex:
 +-branch
 ,-group begin
 | char code 97 (a)
 +-branch
 | char code 98 (b)
 +-branch
 | char code 99 (c)
 `-group end.
 
 > Compiled regex variable \l_tmpb_regex:
 +-branch
 ,-group begin
 | char code 97 (a)
 +-branch
 | char code 98 (b)
 +-branch
 | char code 99 (c)
 `-group end.

But the raw internal structure reveals that there is a difference (code shortened a bit):

 > \l_tmpa_regex=macro:->
 \__regex_branch:n {
 \__regex_group:nnnN {
 \__regex_branch:n {
 <char a>
 \__regex_branch:n { <char b> }
 \__regex_branch:n { <char c> }
 }
 }{1}{0}\c_false_bool
 }
 
 > \l_tmpb_regex=macro:->
 \__regex_branch:n {
 \__regex_group:nnnN {
 \__regex_branch:n { <char a> }
 \__regex_branch:n { <char b> }
 \__regex_branch:n { <char c> }
 }{1}{0}\c_false_bool
 }

If we look at the originally inserted regex, the internal structure seems fine at the start:

 > \l_tmpa_regex=macro:->
 \__regex_branch:n { <char a> }
 \__regex_branch:n { <char b> }
 \__regex_branch:n { <char c> }

Especially because the number of `\__regex_branch:n` commands is the same in both result regexes, I don't think this is a limitation of the internal regex structure but of the way the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong place.