Context-sensitive rules in lark LARL parser

Question

I'm building a LALR grammar for a language where ">" can be either

a) the greater-than operator
b) the closing bracket of a "<" ... ">" construct (tuple literal)

So, parsing of ">" is context-dependent. ">" is:

a) a greater-than in the default context

1 > 2 # greater_than(int(1), int(2))

b) a closing bracket in the tuple context, ie. following an opening bracket not inside parenthesis

<1> # tuple(int(1)) <1, 2> # tuple(int(1), int(2)) <(1 > 2)> # tuple(greater_than(int(1), int(2))) <1> > <2> # greater_than(tuple(int(1)), tuple(int(2)))

With the LARL grammar G1, lark fails to parse "<1>":

$ ./parser.py ... KeyError: 'NEWLINE' ... lark.exceptions.UnexpectedToken: Unexpected token Token('NEWLINE', '\n') at line 4, column 27. Expected one of: * LPAR * LESSTHAN * INT

grammar G1 (fails to parse "<1>"):

program : NEWLINE* (expression NEWLINE+)* ?expression : expression ("," expression)+ -> sequence | value ?value : "(" value ")" | binary | tuple | int binary : value ">" value -> greater_than tuple : "<" expression ">" int : INT %ignore WS_INLINE %ignore SH_COMMENT %import common.INT %import common.NEWLINE %import common.WS_INLINE %import common.SH_COMMENT

lark:

from lark import Lark G1 = open('G1.lark').read().replace('\n :', ':') tree = Lark( grammar=G1, start='program', parser='lalr', lexer='basic' ).parse(''' 1 > 2 # greater_than(int(1), int(2)) <1> # tuple(int(1)) <1, 2> # tuple(int(1), int(2)) <(1 > 2)> # tuple(greater_than(int(1), int(2))) <1> > <2> # greater_than(tuple(int(1)), tuple(int(2))) ''') print(tree.pretty())

IIUC, it fails to parse "<1>" because it parses the trailing ">" as part of the rule binary: value ">" value, ie. it chooses to shift ">" rather than to reduce (as documented eg. here) value(int(1)) to expression(value(int(1))), which would be necessary for applying the rule tuple: "<" expression ">".

At a high level, to handle ">" (and other context-sensitive code), I'd like the grammar to express the idea of a context; that is, for the rule tuple: "<" expression ">" to say: the inside of "<" ... ">" is a tuple context; in this context, expressions are special / restricted, they can't include ">" (unless inside parenthesis), they are "expressions-inside-tuple".

I could do this manually, ie. hardcode / thread the context through the rules, by forking each rule affected by the context (ie. each rule "between" tuple, where the context is set, and binary, where the context is checked) into two variants, one for the default context and one for the tuple context. This works (grammar G2 below), but it makes the grammar much more difficult to work with (especially that the actual grammar is much larger), and it feels like working at a too-low level of abstraction.

Is there an easier way to have context-sensitive rules?

I realize that LARL grammar is context-free, so maybe the idea of a context goes really against it?

If so, how else to make lark handle ">" in the examples above as intended?

Is it with rule priorities? Prorities can be used to resolve shift / reduce conflicts (as documented here again) and grammar G1 does indeed have conflicts:

Shift/Reduce conflict for terminal MORETHAN: (resolving as shift) * <expression : value> Shift/Reduce conflict for terminal COMMA: (resolving as shift) * <__expression_plus_2 : __expression_plus_2 COMMA expression> Shift/Reduce conflict for terminal COMMA: (resolving as shift) * <__expression_plus_2 : COMMA expression> Shift/Reduce conflict for terminal COMMA: (resolving as shift) * <expression : expression __expression_plus_2> Shift/Reduce conflict for terminal MORETHAN: (resolving as shift) * <binary : value MORETHAN value>

I realize C++ has a similar difficulty with angle brackets. How is it solved in C++?

grammar G2 (parses "<1>", tuple context hardcoded / threaded via forked rules with _in_tuple suffix):

program : NEWLINE* (expression NEWLINE+)* ?expression : expression ("," expression)+ -> sequence | value ?expression_in_tuple : expression_in_tuple ("," expression_in_tuple)+ -> sequence | value_in_tuple ?value : "(" value ")" | binary | tuple | int ?value_in_tuple : "(" value ")" | binary_in_tuple | tuple | int binary : value ">" value -> greater_than binary_in_tuple : () tuple : "<" expression_in_tuple ">" int : INT

For anything context-sensitive, you should try to utilize the interactive parser. It allows you to combine manual and grammar parsing with a simple interface. — Erez
– Erez, Commented Aug 31 at 7:47

Collectives™ on Stack Overflow

Context-sensitive rules in lark LARL parser

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.