I'm currently trying to parse a custom configuration format using ANTLR4. Here is what the input may look like (in reality it's a lot more technical, but I had to change it for SO bc I want to keep my job... )
// Example 1: /* $NAME = John "Blue" Doe $AGE = 20 $DESCRIPTION = Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et ==================================== Bytes: - Byte 0 = example 1 - Byte 1 = example 2 - Byte 3 = example 3 $ENDDESCRIPTION $CITY = London $DISTRICT = Sutton */ // Example 2: /* $NAME = Jane "Ruby" Doe $AGE = 28 $DESCRIPTION Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. ==================================== At vero eos et accusam et justo duo dolores et ea rebum. $ENDDESCRIPTION $CITY = Berlin $DISTRICT = Spandau */ Requirements:
As you can see there are a bunch of key-value pairs.
keywordsare defined by a prefixed$symbol followed by some uppercase letters.pairsare generally separated by an=symbol. The only exception are the$DESCRIPTIONkeyword where the=separator is optional (this is due to legacy, some of the files I have to parse are 30+ years old...) and the$ENDDESCRIPTIONkeyword which can't have a separator as it marks the end of a text.valuesare generally defined as "whatever is between the separator and the newline". The only exception is the$DESCRIPTIONkeyword where the value is a formatted text that can span several lines. The end of the text is indicated by the$ENDDESCRIPTIONkeyword
Some lines can contain multiple assignments that contextually belong together (e.g.
$CITY = London $DISTRICT = Sutton)The
$DESCRIPTIONvalue is a formatted text and may also contain the=symbol as part of its content e.g. as dividers.
My Attempt:
Here is the grammar I wrote:
parser grammar Parser; options { tokenVocab = Lexer; } test : data* EOF ; data : keyValuePair+ ; // $KEYWORD = VALUE keyValuePair : keyword ASSIGN value | (DESCRIPTION ) ASSIGN? text+ DESC_END ; keyword : KEYWORD ; value : VALUE ; text : TEXT ; lexer grammar Lexer; COMMENT : '//' ~[\r\n]* -> channel(HIDDEN) ; DESCRIPTION : '$DESCRIPTION' -> mode(DESC) ; KEYWORD : '$'[A-Z]+ ; ASSIGN : '=' -> mode(REGULAR) ; WS : [ \t\r\n]+ -> skip ; ANY : . -> skip ; mode DESC; DESC_END : '$ENDDESCRIPTION' -> mode(DEFAULT_MODE) ; DESC_ASSIGN : '=' -> type(ASSIGN) ; TEXT : (WORD ((' ' | '\t')+ WORD)* | ASSIGN+ ) ; DESC_WS : WS -> skip ; DESC_ANY : ANY -> skip ; mode REGULAR; REGULAR_END : [\r\n]+ -> mode(DEFAULT_MODE), channel(HIDDEN) ; REGULAR_KEYWORD : KEYWORD -> type(KEYWORD), mode(DEFAULT_MODE) ; // take care of multiple assignments per line VALUE : WORD ((' ' | '\t')+ WORD)* ; REGULAR_WS : WS -> skip ; REGULAR_ANY : ANY -> skip ; fragment WORD : CHAR+ ; fragment CHAR : [a-zA-Z0-9_?!@&%()<>|,.:;'"*+/#=-] ; //must NOT contain $ symbol Problem:
I'm really only running into problems with the $DESCRIPTION keyword and the fact, that the separator is optional. By jumping into the DESC mode whenever $DESCRIPTION is parsed, everything works fine for Example 2, but for Example 1 I get a TEXT token with the value "= Lorem ipsum...", because it can be part of TEXT.
Obviously I could trim this in post processing, but if there is a way to avoid that, I would like to know. Also I'm pretty sure, that my approach with the long CHAR regex is not ideal.
The other thing I'm wondering is, if there is a better way to collect the entire formatted text, because later I would like to use it to print it elsewhere. I
This is my first time using ANTLR (or any parser/lexer tool for that matter). Any suggestions or improvements to the grammar would be very much appreciate that! Thanks a lot in advance!