How to parse strings with delimiters that overlap each other?

Question

I'm using ANTLR4 to parse a document (wikitext). In this document, strings surrounded by '' are italicized (like ''italics''). Strings surrounded by ''' are bold (like '''bold'''). But there is also the possibility of strings with both styles applied: '''''both'''''.

The issue is that these delimiters overlap each other. As a result, when the parser encounters five single quotes, it doesn't understand whether it's bold followed by italics or italics followed by bold. And I don't know how to tell it.

Ideally, it would know that it should choose whichever interpretation results in a successful parse, but if it can do that, I don't know how to make it work.

My grammar so far:

grammar WikiText; NOWIKI_OPEN: '<nowiki>'; NOWIKI_CLOSE: '</nowiki>'; BOLD: '\'\'\''; ITALICS: '\'\''; CHAR: .; nowiki: NOWIKI_OPEN CHAR* NOWIKI_CLOSE; plainText: CHAR+; boldText: BOLD wiki* BOLD; italicText: ITALICS wiki* ITALICS; nonPlainText: (boldText | italicText)+; wiki: (nonPlainText | plainText)+; document: (wiki | nowiki)*;

'''''hi'' there.'''. You cannot tokenize beyond a single quote. Semantic analysis must be employed after parsing. Then, rewrite the tree afterwards for '' and ''' operators. — kaby76
– kaby76, Commented Jun 7, 2023 at 11:24

Mike Cargal · Accepted Answer · 2023-06-08 16:01:44Z

As @kaby76 commented, the Lexer would not have enough context to delineate between the items that should flag bold and italics. This makes them a semantic concern (i.e. parser rule)

The Lexer just proceeds character by character attempting to match Lexer rules (the longer the better, and, in the event of two rules matching the same length sequence of characters, the first one wins). The Parser, on the other hand uses a recursive descent approach that will try to match rules only as they are encountered in other rules (recursively from your start rule) so it can take the TICK tokens in context.

You'll also need to group alternatives together into a rule to set up precedence. Be sure to use the non-greedy matching *? on the wiki matches. Otherwise ANTLR will attempt to match as much of the input as possible when evaluating the wiki* rule, and you'll get large blocks wrapped in bold or italics. (You can remove the ? and look at the resulting parse tree to see what I'm referring to.)

grammar WikiText ; NOWIKI_OPEN: '<nowiki>'; NOWIKI_CLOSE: '</nowiki>'; TICK: '\''; CHAR: .; nowiki: NOWIKI_OPEN CHAR* NOWIKI_CLOSE; bold: TICK TICK TICK; italic: TICK TICK; wiki : CHAR+ # plainText | italic wiki*? italic # italicText | bold wiki*? bold # boldText ; wikiText: wiki+; document: (wiki | nowiki)*;

echo "''italic'' '''bold''' '''''bold italic'''''" | grun WikiText document -gui

Re: your followup comment:

I used:

grammar WikiText ; NOWIKI_OPEN: '<nowiki>'; NOWIKI_CLOSE: '</nowiki>'; TICK: '\''; CHAR: .; nowiki: NOWIKI_OPEN CHAR* NOWIKI_CLOSE; wiki : CHAR+ # plainText | TICK TICK TICK wiki*? TICK TICK TICK # boldText | TICK TICK wiki*? TICK TICK # italicText ; wikiText: wiki+; document: (wiki | nowiki)*;

and got this parse tree:

(I used the IntelliJ ANTLR plugin that shows the context type in the diagram)

It does not appear to be confused.

IMHO, the first is a BIT easier to deal with, but you'll largely be ignoring the TICK tokens when working with the parse tree anyway, so there's little difference.

Honestly, I'm not grasping the principle. It seems that if the delimiters are named as a separate parser rule (bold: TICK TICK TICK;), everything is fine. But if I try to use raw ticks (TICK TICK TICK wiki*? TICK TICK TICK), it's back to being confused. I don't get why.
amended answer since the reply would not format well in a comment.

Collectives™ on Stack Overflow

How to parse strings with delimiters that overlap each other?

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related