How to parse optional separator that can be part of formatted text?

Question

I'm currently trying to parse a custom configuration format using ANTLR4. Here is what the input may look like (in reality it's a lot more technical, but I had to change it for SO bc I want to keep my job... )

// Example 1: /* $NAME = John "Blue" Doe $AGE = 20 $DESCRIPTION = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et ==================================== Bytes: - Byte 0 = example 1 - Byte 1 = example 2 - Byte 3 = example 3 $ENDDESCRIPTION $CITY = London $DISTRICT = Sutton */ // Example 2: /* $NAME = Jane "Ruby" Doe $AGE = 28 $DESCRIPTION Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. ==================================== At vero eos et accusam et justo duo dolores et ea rebum. $ENDDESCRIPTION $CITY = Berlin $DISTRICT = Spandau */

Requirements:

As you can see there are a bunch of key-value pairs.
- keywords are defined by a prefixed $ symbol followed by some uppercase letters.
- pairs are generally separated by an = symbol. The only exception are the $DESCRIPTION keyword where the = separator is optional (this is due to legacy, some of the files I have to parse are 30+ years old...) and the $ENDDESCRIPTION keyword which can't have a separator as it marks the end of a text.
- values are generally defined as "whatever is between the separator and the newline". The only exception is the $DESCRIPTION keyword where the value is a formatted text that can span several lines. The end of the text is indicated by the $ENDDESCRIPTION keyword
Some lines can contain multiple assignments that contextually belong together (e.g. $CITY = London $DISTRICT = Sutton)
The $DESCRIPTION value is a formatted text and may also contain the = symbol as part of its content e.g. as dividers.

My Attempt:

Here is the grammar I wrote:

parser grammar Parser; options { tokenVocab = Lexer; } test : data* EOF ; data : keyValuePair+ ; // $KEYWORD = VALUE keyValuePair : keyword ASSIGN value | (DESCRIPTION ) ASSIGN? text+ DESC_END ; keyword : KEYWORD ; value : VALUE ; text : TEXT ;

lexer grammar Lexer; COMMENT : '//' ~[\r\n]* -> channel(HIDDEN) ; DESCRIPTION : '$DESCRIPTION' -> mode(DESC) ; KEYWORD : '$'[A-Z]+ ; ASSIGN : '=' -> mode(REGULAR) ; WS : [ \t\r\n]+ -> skip ; ANY : . -> skip ; mode DESC; DESC_END : '$ENDDESCRIPTION' -> mode(DEFAULT_MODE) ; DESC_ASSIGN : '=' -> type(ASSIGN) ; TEXT : (WORD ((' ' | '\t')+ WORD)* | ASSIGN+ ) ; DESC_WS : WS -> skip ; DESC_ANY : ANY -> skip ; mode REGULAR; REGULAR_END : [\r\n]+ -> mode(DEFAULT_MODE), channel(HIDDEN) ; REGULAR_KEYWORD : KEYWORD -> type(KEYWORD), mode(DEFAULT_MODE) ; // take care of multiple assignments per line VALUE : WORD ((' ' | '\t')+ WORD)* ; REGULAR_WS : WS -> skip ; REGULAR_ANY : ANY -> skip ; fragment WORD : CHAR+ ; fragment CHAR : [a-zA-Z0-9_?!@&%()<>|,.:;'"*+/#=-] ; //must NOT contain $ symbol

Problem:

I'm really only running into problems with the $DESCRIPTION keyword and the fact, that the separator is optional. By jumping into the DESC mode whenever $DESCRIPTION is parsed, everything works fine for Example 2, but for Example 1 I get a TEXT token with the value "= Lorem ipsum...", because it can be part of TEXT.
Obviously I could trim this in post processing, but if there is a way to avoid that, I would like to know. Also I'm pretty sure, that my approach with the long CHAR regex is not ideal.

The other thing I'm wondering is, if there is a better way to collect the entire formatted text, because later I would like to use it to print it elsewhere. I

This is my first time using ANTLR (or any parser/lexer tool for that matter). Any suggestions or improvements to the grammar would be very much appreciate that! Thanks a lot in advance!

Hello and welcome! Very good first contribution. And the subject of AntLR 4 is not a simple one! — Marc Le Bihan
– Marc Le Bihan, Commented Sep 5 at 3:26

Bart Kiers · Accepted Answer · 2025-09-06 10:00:31Z

You could add a pre-desc mode where you will optionally match the = token:

lexer grammar Lexer; ... DESCRIPTION : '$DESCRIPTION' -> mode(PRE_DESC) ; ... mode PRE_DESC; PRE_DESC_ASSIGN : '=' -> type(ASSIGN) ; PRE_DESC_WS : WS -> skip ; PRE_DESC_OTHER : ~[=] -> more, mode(DESC) ; mode DESC; ... // DESC_ASSIGN : '=' -> type(ASSIGN) ; <-- can be removed ... mode REGULAR; ...

Read more about the -> more command: https://github.com/antlr/antlr4/blob/dev/doc/lexer-rules.md#mode-pushmode-popmode-and-more

For example 1 that would result in the tokens:

KEYWORD '$NAME' ASSIGN '=' VALUE 'John "Blue" Doe' REGULAR_END '\n' KEYWORD '$AGE' ASSIGN '=' VALUE '20' REGULAR_END '\n\n' DESCRIPTION '$DESCRIPTION' ASSIGN '=' TEXT 'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et' TEXT '====================================' TEXT 'Bytes:' TEXT '- Byte 0 = example 1' TEXT '- Byte 1 = example 2' TEXT '- Byte 3 = example 3' DESC_END '$ENDDESCRIPTION' KEYWORD '$CITY' ASSIGN '=' VALUE 'London' KEYWORD '$DISTRICT' ASSIGN '=' VALUE 'Sutton' EOF '<EOF>'

And the tokens for example 2:

KEYWORD '$NAME' ASSIGN '=' VALUE 'Jane "Ruby" Doe' REGULAR_END '\n' KEYWORD '$AGE' ASSIGN '=' VALUE '28' REGULAR_END '\n\n' DESCRIPTION '$DESCRIPTION' TEXT 'Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod' TEXT 'tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.' TEXT '====================================' TEXT 'At vero eos et accusam et justo duo dolores et ea rebum.' DESC_END '$ENDDESCRIPTION' KEYWORD '$CITY' ASSIGN '=' VALUE 'Berlin' KEYWORD '$DISTRICT' ASSIGN '=' VALUE 'Spandau' EOF '<EOF>'

The only remaining question now is if there is a good way to keep the text formatting, specifically stuff like indentation. Some of my users create entire monospace tables by hand as part of their description and I would ideally keep these in tact. Would it be as simple as allowing everything except for ~[\r\n]+ or is there a better way?

Yes, that is a good way to handle it:

mode DESC; DESC_END : '$ENDDESCRIPTION' -> mode(DEFAULT_MODE) ; TEXT : ~[\r\n]+ ; NEW_LINE : [\r\n]+ -> skip ;

Just be sure $ENDDESCRIPTION is the only thing on the line: when it is preceded by a space, it'd get picked up as a TEXT. To be sure that never happens, place some WS* around it:

DESC_END : WS* '$ENDDESCRIPTION' WS* -> mode(DEFAULT_MODE) ;

Thank you! The output seems to be exactly what I'm looking for. I'll definitely check out the more command. The only remaining question now is if there is a good way to keep the text formatting, specifically stuff like indentation. Some of my users create entire monospace tables by hand as part of their description and I would ideally keep these in tact. Would it be as simple as allowing everything except for ~[\r\n]+ or is there a better way?

Collectives™ on Stack Overflow

How to parse optional separator that can be part of formatted text?

Requirements:

My Attempt:

Problem:

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

Requirements:

My Attempt:

Problem:

1 Answer 1

2 Comments

Related