ANTLR lexer disabling tokens then reenabling them not working as expected

Question

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled

I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?

I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.

The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.

Edit : Part of my grammar follows below

grammar LuceneQueryParser; @header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;} @lexer::members { public boolean phrases = true; } @parser::members { public boolean phraseQueries = true; } mainQ : LPAREN query RPAREN | query ; query : not ((AND|OR)? not)* ; andClause : AND ; orClause : OR ; not : NOT? modifier? clause; clause : qualified | unqualified ; unqualified : LBRACK range_in LBRACK | LCURL range_out RCURL | truncated | {phraseQueries}? quoted | LPAREN query RPAREN | normal ; truncated : TERM_TEXT_TRUNCATED; range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR); range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR); qualified : TERM_TEXT COLON unqualified ; normal : TERM_TEXT; quoted : PHRASE_TEXT; modifier : PLUS | MINUS ; PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?; TERM_TEXT : (TERM_CHAR|ESCAPE)+; TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000' | '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}' | '+' | '-' | '!' | ':' | '~' | '^' | '*' | '|' | '&' | '?' ); ESCAPE : '\\' ~[];

The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.

I would need to see more of the grammar and the calling code in order to answer this question. — Sam Harwell
– Sam Harwell, Commented Sep 17, 2013 at 0:28
You might want to look into ANTLR4's support for lexical modes, and try to trigger that switching mechanism from your code. I believe the feature was intended to support situations such as embedding PHP inside HTML. — Darien
– Darien, Commented Sep 23, 2013 at 20:37

omu_negru · Accepted Answer · 2013-09-17 14:56:51Z

I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.

Collectives™ on Stack Overflow

ANTLR lexer disabling tokens then reenabling them not working as expected

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related