Revisions to Building regex from another one

Added note about current regex compilation process

edited Jul 12, 2019 at 18:00

13.7k
2
34
57

Especially becauseEspecially because the number of \__regex_branch:n commands is the same in both result regexes, I don't think this is a limitation of the internal regex structure but of the way the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong place.

The problem seems to be more compilcated than just shifting some braces around. When l3regex reads the numberdefinition of a branch, it has internally already pushed the sequence \__regex_branch:n { \if_false: } \fi: commands ison the same in both result regexes, I don't think this is a limitationtoken list. A proper fix would therefore have to check in what context it occurs (beginning/middle/end of a branch) and to modify the internal regex structure but of the waytoken list such that the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong placebe inserted fits in correctly.

Especially because the number of \__regex_branch:n commands is the same in both result regexes, I don't think this is a limitation of the internal regex structure but of the way the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong place.

The problem seems to be more compilcated than just shifting some braces around. When l3regex reads the definition of a branch, it has internally already pushed the sequence \__regex_branch:n { \if_false: } \fi: on the result token list. A proper fix would therefore have to check in what context it occurs (beginning/middle/end of a branch) and to modify the internal token list such that the regex to be inserted fits in correctly.

Added example that doesn't work with accepted answer

Source Link

edited Jul 12, 2019 at 11:32

siracusa

13.7k
2
34
57

EDIT:EDIT: I found a note in the l3regex documentation about features that are "likely to be implemented at some point in the future":

(By the way, it would be really helpful if the \regex_show: functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of char code XXX are harder to debug than necessary.)

EDIT 2: Unfortunately the accepted answer doesn't seems to work correctly when a complete regex is inserted in a capturing group. The following code reproduces the problem:

\regex_set:Nn \l_tmpa_regex { a|b|c } \cs_show:N \l_tmpa_regex \regex_set:Nn \l_tmpa_regex { (\y{l_tmpa_regex}) } \regex_set:Nn \l_tmpb_regex { (a|b|c) } \regex_show:N \l_tmpa_regex \regex_show:N \l_tmpb_regex \cs_show:N \l_tmpa_regex \cs_show:N \l_tmpb_regex \regex_extract_once:NnNTF \l_tmpa_regex {a} \l_tmpa_seq {} {} \seq_show:N \l_tmpa_seq \regex_extract_once:NnNTF \l_tmpb_regex {a} \l_tmpa_seq {} {} \seq_show:N \l_tmpa_seq

Both regexes are supposed to match on the token list a but only the second one does.

The final regexes in \l_tmpa_regex and \l_tmpb_regex should be the same, as the two identical \regex_show:N outputs suggest:

> Compiled regex variable \l_tmpa_regex: +-branch ,-group begin | char code 97 (a) +-branch | char code 98 (b) +-branch | char code 99 (c) `-group end. > Compiled regex variable \l_tmpb_regex: +-branch ,-group begin | char code 97 (a) +-branch | char code 98 (b) +-branch | char code 99 (c) `-group end.

But the raw internal structure reveals that there is a difference (code shortened a bit):

> \l_tmpa_regex=macro:-> \__regex_branch:n { \__regex_group:nnnN { \__regex_branch:n { <char a> \__regex_branch:n { <char b> } \__regex_branch:n { <char c> } } }{1}{0}\c_false_bool } > \l_tmpb_regex=macro:-> \__regex_branch:n { \__regex_group:nnnN { \__regex_branch:n { <char a> } \__regex_branch:n { <char b> } \__regex_branch:n { <char c> } }{1}{0}\c_false_bool }

If we look at the originally inserted regex, the internal structure seems fine at the start:

> \l_tmpa_regex=macro:-> \__regex_branch:n { <char a> } \__regex_branch:n { <char b> } \__regex_branch:n { <char c> }

Especially because the number of \__regex_branch:n commands is the same in both result regexes, I don't think this is a limitation of the internal regex structure but of the way the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong place.

EDIT: I found a note in the l3regex documentation about features that are "likely to be implemented at some point in the future":

(By the way, it would be really helpful if the \regex_show: functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of char code XXX are harder to debug than necessary.)

EDIT: I found a note in the l3regex documentation about features that are "likely to be implemented at some point in the future":

(By the way, it would be really helpful if the \regex_show: functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of char code XXX are harder to debug than necessary.)

EDIT 2: Unfortunately the accepted answer doesn't seems to work correctly when a complete regex is inserted in a capturing group. The following code reproduces the problem:

\regex_set:Nn \l_tmpa_regex { a|b|c } \cs_show:N \l_tmpa_regex \regex_set:Nn \l_tmpa_regex { (\y{l_tmpa_regex}) } \regex_set:Nn \l_tmpb_regex { (a|b|c) } \regex_show:N \l_tmpa_regex \regex_show:N \l_tmpb_regex \cs_show:N \l_tmpa_regex \cs_show:N \l_tmpb_regex \regex_extract_once:NnNTF \l_tmpa_regex {a} \l_tmpa_seq {} {} \seq_show:N \l_tmpa_seq \regex_extract_once:NnNTF \l_tmpb_regex {a} \l_tmpa_seq {} {} \seq_show:N \l_tmpa_seq

Both regexes are supposed to match on the token list a but only the second one does.

The final regexes in \l_tmpa_regex and \l_tmpb_regex should be the same, as the two identical \regex_show:N outputs suggest:

> Compiled regex variable \l_tmpa_regex: +-branch ,-group begin | char code 97 (a) +-branch | char code 98 (b) +-branch | char code 99 (c) `-group end. > Compiled regex variable \l_tmpb_regex: +-branch ,-group begin | char code 97 (a) +-branch | char code 98 (b) +-branch | char code 99 (c) `-group end.

But the raw internal structure reveals that there is a difference (code shortened a bit):

> \l_tmpa_regex=macro:-> \__regex_branch:n { \__regex_group:nnnN { \__regex_branch:n { <char a> \__regex_branch:n { <char b> } \__regex_branch:n { <char c> } } }{1}{0}\c_false_bool } > \l_tmpb_regex=macro:-> \__regex_branch:n { \__regex_group:nnnN { \__regex_branch:n { <char a> } \__regex_branch:n { <char b> } \__regex_branch:n { <char c> } }{1}{0}\c_false_bool }

If we look at the originally inserted regex, the internal structure seems fine at the start:

> \l_tmpa_regex=macro:-> \__regex_branch:n { <char a> } \__regex_branch:n { <char b> } \__regex_branch:n { <char c> }

Especially because the number of \__regex_branch:n commands is the same in both result regexes, I don't think this is a limitation of the internal regex structure but of the way the accepted answer inserts one regex into another. It's probably related to brace tokens going to the wrong place.

Added note from documentation

Source Link

edited Jun 9, 2019 at 6:15

siracusa

13.7k
2
34
57

I'm trying to define a command where the user can define a regex, which should then be used to create a larger regex that in turn will be matched on a token list.

The naive approach to insert it from a token list obviously fails because of a wrong catcode setup:

\documentclass{article} \usepackage{expl3} \begin{document} \ExplSyntaxOn \tl_set:Nn \l_foo_tl { [a-z]+ } \regex_const:Nn \l_foo_regex { (\w+)( \[ \u{l_foo_tl} \] ) } \regex_show:N \l_foo_regex \seq_new:N \l_foo_seq \regex_extract_all:NnN \l_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq \seq_show:N \l_foo_seq \ExplSyntaxOff \end{document}

outputs

+-branch ,-group begin | Match, repeated 1 or more times, greedy | range [97,122] | range [65,90] | range [48,57] | char code 95 `-group end ,-group begin | char code 91 | char 91, catcode 12 | char 97, catcode 11 | char 45, catcode 12 | char 122, catcode 11 | char 93, catcode 12 | char 43, catcode 12 | char code 93 `-group end.

What I'd want is either something like

\regex_const:Nn \l_sub_regex { [a-z]+ } \regex_const:Nn \l_foo_regex { (\w+)( \[ ... \] ) }

where ... somehow inserts the regex represented by \l_sub_regex (both \c{l_sub_regex} and \u{l_sub_regex} give wrong results here); or a way to convert a compiled regex back to its string representation, something like \regex_to_str:N.

Perhaps there's a way to insert it back from a token list using some \detokenize or \scantokens hackery, but I'm wondering if l3regex already provides a proper solution for this.

EDIT: I found a note in the l3regex documentation about features that are "likely to be implemented at some point in the future":

Provide a syntax such as \ur{l_my_regex} to use an already-compiled regex in a more complicated regex. This makes regexes more easily composable.

So it seems such a feature doesn't currently exist but is planned for the future.

(By the way, it would be really helpful if the \regex_show: functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of char code XXX are harder to debug than necessary.)

I'm trying to define a command where the user can define a regex, which should then be used to create a larger regex that in turn will be matched on a token list.

The naive approach to insert it from a token list obviously fails because of a wrong catcode setup:

\documentclass{article} \usepackage{expl3} \begin{document} \ExplSyntaxOn \tl_set:Nn \l_foo_tl { [a-z]+ } \regex_const:Nn \l_foo_regex { (\w+)( \[ \u{l_foo_tl} \] ) } \regex_show:N \l_foo_regex \seq_new:N \l_foo_seq \regex_extract_all:NnN \l_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq \seq_show:N \l_foo_seq \ExplSyntaxOff \end{document}

outputs

+-branch ,-group begin | Match, repeated 1 or more times, greedy | range [97,122] | range [65,90] | range [48,57] | char code 95 `-group end ,-group begin | char code 91 | char 91, catcode 12 | char 97, catcode 11 | char 45, catcode 12 | char 122, catcode 11 | char 93, catcode 12 | char 43, catcode 12 | char code 93 `-group end.

What I'd want is either something like

\regex_const:Nn \l_sub_regex { [a-z]+ } \regex_const:Nn \l_foo_regex { (\w+)( \[ ... \] ) }

where ... somehow inserts the regex represented by \l_sub_regex (both \c{l_sub_regex} and \u{l_sub_regex} give wrong results here); or a way to convert a compiled regex back to its string representation, something like \regex_to_str:N.

Perhaps there's a way to insert it back from a token list using some \detokenize or \scantokens hackery, but I'm wondering if l3regex already provides a proper solution for this.

(By the way, it would be really helpful if the \regex_show: functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of char code XXX are harder to debug than necessary.)

I'm trying to define a command where the user can define a regex, which should then be used to create a larger regex that in turn will be matched on a token list.

The naive approach to insert it from a token list obviously fails because of a wrong catcode setup:

\documentclass{article} \usepackage{expl3} \begin{document} \ExplSyntaxOn \tl_set:Nn \l_foo_tl { [a-z]+ } \regex_const:Nn \l_foo_regex { (\w+)( \[ \u{l_foo_tl} \] ) } \regex_show:N \l_foo_regex \seq_new:N \l_foo_seq \regex_extract_all:NnN \l_foo_regex { a[x], b[yy], c[zzz] } \l_foo_seq \seq_show:N \l_foo_seq \ExplSyntaxOff \end{document}

outputs

+-branch ,-group begin | Match, repeated 1 or more times, greedy | range [97,122] | range [65,90] | range [48,57] | char code 95 `-group end ,-group begin | char code 91 | char 91, catcode 12 | char 97, catcode 11 | char 45, catcode 12 | char 122, catcode 11 | char 93, catcode 12 | char 43, catcode 12 | char code 93 `-group end.

What I'd want is either something like

\regex_const:Nn \l_sub_regex { [a-z]+ } \regex_const:Nn \l_foo_regex { (\w+)( \[ ... \] ) }

where ... somehow inserts the regex represented by \l_sub_regex (both \c{l_sub_regex} and \u{l_sub_regex} give wrong results here); or a way to convert a compiled regex back to its string representation, something like \regex_to_str:N.

Perhaps there's a way to insert it back from a token list using some \detokenize or \scantokens hackery, but I'm wondering if l3regex already provides a proper solution for this.

EDIT: I found a note in the l3regex documentation about features that are "likely to be implemented at some point in the future":

Provide a syntax such as \ur{l_my_regex} to use an already-compiled regex in a more complicated regex. This makes regexes more easily composable.

So it seems such a feature doesn't currently exist but is planned for the future.

(By the way, it would be really helpful if the \regex_show: functions would also print the actual ASCII representation of a character if it is in the set of printable characters. Several lines of char code XXX are harder to debug than necessary.)

Source Link

asked Jun 8, 2019 at 15:13

siracusa

13.7k
2
34
57

Loading

Stack Exchange Network

Return to Question