3
$\begingroup$

What is this step called to convert a parse tree into a pattern tree?

I need to understand terminologies on the act of conversion/rebalancing the parse tree such that the most simplistic regular expression (regex) are closest to the root node and conversely, the most open-ended regex (wildcard) are located at the end nodes of a pattern tree.

Some actions that I lacked words for that come to mind are:

  1. Pulling patterns from a parse tree full of token and expressions into a deterministic form of Aho-Corasick-like tree.

  2. Translating large number of Bison/Lexx spec files for a quickest identification of content (ie., Unix file command).

Real world applications are EBNF, VimL (vimscript), TextMate spec files into a low O-order search algorithm.

$\endgroup$
3
  • $\begingroup$ I don't think I understand. Are you trying to implement a parser that parses regexes? Or do you have a language in which regexes are a primitive, such that there are nodes in your AST that represent individual regexes? Or exactly what? $\endgroup$ Commented Sep 12, 2024 at 9:00
  • $\begingroup$ It is about taking all the known concrete syntax trees/graphs and putting together and making something like a Unix file command but with the goal of making a close to O(1) search pattern engine as possible. Many CST/CSG specs have regex expressions. Not even getting to content processing stage here ... yet. Rearranging and regrouping state loops. Theorectical. $\endgroup$ Commented Sep 12, 2024 at 11:53
  • $\begingroup$ By the way, I recalled that this SE info states that it does not focus on AST. $\endgroup$ Commented Sep 12, 2024 at 12:01

1 Answer 1

4
$\begingroup$

I don’t know if there’s another term of art for this, but it’s a kind of topological sorting by a partial ordering. This comes up in pattern matching, generic function instantiation, and multimethod dispatch, particularly when you want to find the unique most specific matching pattern for a given input. A special case in typechecking is subsumption, where you want to check that a definition’s inferred type is at least as general as its declared type signature.

You have a partially ordered set (poset) of patterns, where $P < Q$ if $P$ is less general than $Q$, meaning that $Q$ matches at least everything $P$ does. In other words $P$ is more specific than $Q$. Incomparable elements are patterns like (regex) a. and .b where neither one matches everything that the other does. All atomic patterns like a and b are incomparable in this way. They’re also “atomic elements” of the poset, meaning minimal among the nonzero elements.

You can define this ordering in various ways depending on the pattern language. For a regular language, you can consider $P < Q$ when their difference (intersection with complement) is empty: $P \setminus Q = P \cap \neg Q$ and $(P \setminus Q = \emptyset) \iff (P \subseteq Q)$. Computing this is decidable in polynomial time, for example using the Brzozowski derivative.

However, it’s easy for large inputs or extensions to the pattern language to make that method intractable or undecidable, so it’s more typical to define an ordering recursively. You can handle some cases precisely, like any pattern is less general than a wildcard $a < \top$, and ordering distributes over sequences of patterns, $ab < cd$ if $a < b$ and $c < d$. Other cases like choice $a+b \stackrel{?}{<} c+d$ you might handle heuristically.

Standard toposort algorithms apply, although I’ll note that if you have a term, you don’t need to sort the whole set of patterns to find the best matching pattern for it. Instead you can just iterate over the patterns while keeping a set of the most specific incomparable matching candidates seen so far. Whenever you find a match more specific than all of them, it becomes the new sole candidate. In the end you’ll have an ambiguity (>1), a unique match (=1), or no match (<1).

You may also be interested in the next step, Compiling pattern matching to good decision trees(PDF) (Maranget 2008).

$\endgroup$
1
  • $\begingroup$ Really nailed it. Pattern matching, generic function instantiation, poset, and multimethod dispatch with topsort is the coverage that does what I have embarked upon. Good decision tree, too (was aiming for minimal tree over naive tree). $\endgroup$ Commented Sep 10, 2024 at 17:51

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.