What do you call the act of rebalance an parse tree so simplistic regex are closest to the root node over variables/identifiers?

Question

What is this step called to convert a parse tree into a pattern tree?

I need to understand terminologies on the act of conversion/rebalancing the parse tree such that the most simplistic regular expression (regex) are closest to the root node and conversely, the most open-ended regex (wildcard) are located at the end nodes of a pattern tree.

Some actions that I lacked words for that come to mind are:

Pulling patterns from a parse tree full of token and expressions into a deterministic form of Aho-Corasick-like tree.
Translating large number of Bison/Lexx spec files for a quickest identification of content (ie., Unix file command).

Real world applications are EBNF, VimL (vimscript), TextMate spec files into a low O-order search algorithm.

I don't think I understand. Are you trying to implement a parser that parses regexes? Or do you have a language in which regexes are a primitive, such that there are nodes in your AST that represent individual regexes? Or exactly what? — Karl Knechtel
– Karl Knechtel, Commented Sep 12, 2024 at 9:00
It is about taking all the known concrete syntax trees/graphs and putting together and making something like a Unix file command but with the goal of making a close to O(1) search pattern engine as possible. Many CST/CSG specs have regex expressions. Not even getting to content processing stage here ... yet. Rearranging and regrouping state loops. Theorectical. — John Greene
– John Greene, Commented Sep 12, 2024 at 11:53
By the way, I recalled that this SE info states that it does not focus on AST. — John Greene
– John Greene, Commented Sep 12, 2024 at 12:01

Jon Purdy · Accepted Answer · 2024-09-10 15:51:31Z

I don’t know if there’s another term of art for this, but it’s a kind of topological sorting by a partial ordering. This comes up in pattern matching, generic function instantiation, and multimethod dispatch, particularly when you want to find the unique most specific matching pattern for a given input. A special case in typechecking is subsumption, where you want to check that a definition’s inferred type is at least as general as its declared type signature.

You have a partially ordered set (poset) of patterns, where $P < Q$ if $P$ is less general than $Q$, meaning that $Q$ matches at least everything $P$ does. In other words $P$ is more specific than $Q$. Incomparable elements are patterns like (regex) a. and .b where neither one matches everything that the other does. All atomic patterns like a and b are incomparable in this way. They’re also “atomic elements” of the poset, meaning minimal among the nonzero elements.

You can define this ordering in various ways depending on the pattern language. For a regular language, you can consider $P < Q$ when their difference (intersection with complement) is empty: $P \setminus Q = P \cap \neg Q$ and $(P \setminus Q = \emptyset) \iff (P \subseteq Q)$. Computing this is decidable in polynomial time, for example using the Brzozowski derivative.

However, it’s easy for large inputs or extensions to the pattern language to make that method intractable or undecidable, so it’s more typical to define an ordering recursively. You can handle some cases precisely, like any pattern is less general than a wildcard $a < \top$, and ordering distributes over sequences of patterns, $ab < cd$ if $a < b$ and $c < d$. Other cases like choice $a+b \stackrel{?}{<} c+d$ you might handle heuristically.

Standard toposort algorithms apply, although I’ll note that if you have a term, you don’t need to sort the whole set of patterns to find the best matching pattern for it. Instead you can just iterate over the patterns while keeping a set of the most specific incomparable matching candidates seen so far. Whenever you find a match more specific than all of them, it becomes the new sole candidate. In the end you’ll have an ambiguity (>1), a unique match (=1), or no match (<1).

You may also be interested in the next step, Compiling pattern matching to good decision trees^(PDF) (Maranget 2008).

Really nailed it. Pattern matching, generic function instantiation, poset, and multimethod dispatch with topsort is the coverage that does what I have embarked upon. Good decision tree, too (was aiming for minimal tree over naive tree). — John Greene
– John Greene, Commented Sep 10, 2024 at 17:51

Stack Exchange Network

What do you call the act of rebalance an parse tree so simplistic regex are closest to the root node over variables/identifiers?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

What do you call the act of rebalance an parse tree so simplistic regex are closest to the root node over variables/identifiers?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions