97

I've been searching A LOT about this and I couldn't find anything useful that REALLY helps me build an AST. I already know that ANTLR4 doesn't build AST like ANTLR3 used to do. Everyone say: "Hey, use visitors!", but I couldn't find any example or more detailed explanation on HOW can I do this...

I have a grammar must like C, but with every commands written in Portuguese (portuga programming language). I can easily generate the parse tree using ANTLR4. My question is: What I need to do now to create an AST?

BTW, I'm using Java and IntelliJ...

EDIT1: The closest I could get was using the answer of this topic: Is there a simple example of using antlr4 to create an AST from java source code and extract methods, variables and comments? But it only prints the name of the visited methods..

Since the first attempt didn't work for me as I expected, I tried to use this tutorial from ANTLR3, but I couldn't figure out how to use StringTamplate instead of ST...

Reading the book The Definitive ANTLR 4 Reference I also couldn't find anything related to ASTs.

EDIT2: Now I have one class to create the DOT file, I just need figure out on how to use visitors properly

6
  • 2
    Could you share some of what you've tried? Commented Apr 30, 2015 at 15:11
  • @SandyGifford I edited my post trying to explain... I don't have my code right now because I just deleted what I made. Right now I only have the generated codes from ATNLR4 (parser, lexer and base visitors and listeners) Commented Apr 30, 2015 at 15:24
  • Unfortunately, I don't know anything about ANTLR (you came up in my queue) but you have raised the quality of the post! Commented Apr 30, 2015 at 15:31
  • See this answer for a discussion of "CST" vs "AST": stackoverflow.com/a/29456792/120163 The commentary at the end discusses how another actually gets an AST (essentially by walking the CST and manufacturing the AST he wants). Commented May 1, 2015 at 5:07
  • I was in a similar situation like you, after I failed to find a super simple AST example using ANTLR I have created one myself github.com/adamsiemion/antlr-java-ast Commented Nov 8, 2016 at 8:58

4 Answers 4

226

Ok, let's build a simple math example. Building an AST is totally overkill for such a task but it's a nice way to show the principle.

I'll do it in C# but the Java version would be very similar.

The grammar

First, let's write a very basic math grammar to work with:

grammar Math; compileUnit : expr EOF ; expr : '(' expr ')' # parensExpr | op=('+'|'-') expr # unaryExpr | left=expr op=('*'|'/') right=expr # infixExpr | left=expr op=('+'|'-') right=expr # infixExpr | func=ID '(' expr ')' # funcExpr | value=NUM # numberExpr ; OP_ADD: '+'; OP_SUB: '-'; OP_MUL: '*'; OP_DIV: '/'; NUM : [0-9]+ ('.' [0-9]+)? ([eE] [+-]? [0-9]+)?; ID : [a-zA-Z]+; WS : [ \t\r\n] -> channel(HIDDEN); 

Pretty basic stuff, we have a single expr rule that handles everything (precedence rules etc).

The AST nodes

Then, let's define some AST nodes we'll use. These are totally custom and you can define them in the way you want to.

Here are the nodes we'll be using for this example:

internal abstract class ExpressionNode { } internal abstract class InfixExpressionNode : ExpressionNode { public ExpressionNode Left { get; set; } public ExpressionNode Right { get; set; } } internal class AdditionNode : InfixExpressionNode { } internal class SubtractionNode : InfixExpressionNode { } internal class MultiplicationNode : InfixExpressionNode { } internal class DivisionNode : InfixExpressionNode { } internal class NegateNode : ExpressionNode { public ExpressionNode InnerNode { get; set; } } internal class FunctionNode : ExpressionNode { public Func<double, double> Function { get; set; } public ExpressionNode Argument { get; set; } } internal class NumberNode : ExpressionNode { public double Value { get; set; } } 

Converting a CST to an AST

ANTLR generated the CST nodes for us (the MathParser.*Context classes). We now have to convert these to AST nodes.

This is easily done with a visitor, and ANTLR provides us with a MathBaseVisitor<T> class, so let's work with that.

internal class BuildAstVisitor : MathBaseVisitor<ExpressionNode> { public override ExpressionNode VisitCompileUnit(MathParser.CompileUnitContext context) { return Visit(context.expr()); } public override ExpressionNode VisitNumberExpr(MathParser.NumberExprContext context) { return new NumberNode { Value = double.Parse(context.value.Text, NumberStyles.AllowDecimalPoint | NumberStyles.AllowExponent) }; } public override ExpressionNode VisitParensExpr(MathParser.ParensExprContext context) { return Visit(context.expr()); } public override ExpressionNode VisitInfixExpr(MathParser.InfixExprContext context) { InfixExpressionNode node; switch (context.op.Type) { case MathLexer.OP_ADD: node = new AdditionNode(); break; case MathLexer.OP_SUB: node = new SubtractionNode(); break; case MathLexer.OP_MUL: node = new MultiplicationNode(); break; case MathLexer.OP_DIV: node = new DivisionNode(); break; default: throw new NotSupportedException(); } node.Left = Visit(context.left); node.Right = Visit(context.right); return node; } public override ExpressionNode VisitUnaryExpr(MathParser.UnaryExprContext context) { switch (context.op.Type) { case MathLexer.OP_ADD: return Visit(context.expr()); case MathLexer.OP_SUB: return new NegateNode { InnerNode = Visit(context.expr()) }; default: throw new NotSupportedException(); } } public override ExpressionNode VisitFuncExpr(MathParser.FuncExprContext context) { var functionName = context.func.Text; var func = typeof(Math) .GetMethods(BindingFlags.Public | BindingFlags.Static) .Where(m => m.ReturnType == typeof(double)) .Where(m => m.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] { typeof(double) })) .FirstOrDefault(m => m.Name.Equals(functionName, StringComparison.OrdinalIgnoreCase)); if (func == null) throw new NotSupportedException(string.Format("Function {0} is not supported", functionName)); return new FunctionNode { Function = (Func<double, double>)func.CreateDelegate(typeof(Func<double, double>)), Argument = Visit(context.expr()) }; } } 

As you can see, it's just a matter of creating an AST node out of a CST node by using a visitor. The code should be pretty self-explanatory (well, maybe except for the VisitFuncExpr stuff, but it's just a quick way to wire up a delegate to a suitable method of the System.Math class).

And here you have the AST building stuff. That's all that's needed. Just extract the relevant information from the CST and keep it in the AST.

The AST visitor

Now, let's play a bit with the AST. We'll have to build an AST visitor base class to traverse it. Let's just do something similar to the AbstractParseTreeVisitor<T> provided by ANTLR.

internal abstract class AstVisitor<T> { public abstract T Visit(AdditionNode node); public abstract T Visit(SubtractionNode node); public abstract T Visit(MultiplicationNode node); public abstract T Visit(DivisionNode node); public abstract T Visit(NegateNode node); public abstract T Visit(FunctionNode node); public abstract T Visit(NumberNode node); public T Visit(ExpressionNode node) { return Visit((dynamic)node); } } 

Here, I took advantage of C#'s dynamic keyword to perform a double-dispatch in one line of code. In Java, you'll have to do the wiring yourself with a sequence of if statements like these:

if (node is AdditionNode) { return Visit((AdditionNode)node); } else if (node is SubtractionNode) { return Visit((SubtractionNode)node); } else if ... 

But I just went for the shortcut for this example.

Work with the AST

So, what can we do with a math expression tree? Evaluate it, of course! Let's implement an expression evaluator:

internal class EvaluateExpressionVisitor : AstVisitor<double> { public override double Visit(AdditionNode node) { return Visit(node.Left) + Visit(node.Right); } public override double Visit(SubtractionNode node) { return Visit(node.Left) - Visit(node.Right); } public override double Visit(MultiplicationNode node) { return Visit(node.Left) * Visit(node.Right); } public override double Visit(DivisionNode node) { return Visit(node.Left) / Visit(node.Right); } public override double Visit(NegateNode node) { return -Visit(node.InnerNode); } public override double Visit(FunctionNode node) { return node.Function(Visit(node.Argument)); } public override double Visit(NumberNode node) { return node.Value; } } 

Pretty simple once we have an AST, isn't it?

Putting it all together

Last but not least, we have to actually write the main program:

internal class Program { private static void Main() { while (true) { Console.Write("> "); var exprText = Console.ReadLine(); if (string.IsNullOrWhiteSpace(exprText)) break; var inputStream = new AntlrInputStream(new StringReader(exprText)); var lexer = new MathLexer(inputStream); var tokenStream = new CommonTokenStream(lexer); var parser = new MathParser(tokenStream); try { var cst = parser.compileUnit(); var ast = new BuildAstVisitor().VisitCompileUnit(cst); var value = new EvaluateExpressionVisitor().Visit(ast); Console.WriteLine("= {0}", value); } catch (Exception ex) { Console.WriteLine(ex.Message); } Console.WriteLine(); } } } 

And now we can finally play with it:

enter image description here

Sign up to request clarification or add additional context in comments.

15 Comments

@Waschbaer IMO fine-tuning the nodes manually improves the maintainability and is worth it, but you may disagree. ANTLR 3 had an AST output mode, so you could use that, if you're comfortable with the usage of an outdated tool. Or perhaps you could use metaprogramming/templating to generate some boilerplate code, but you may well end up with more work in the end. One other option is to work with the CST directly.
In many years of reading SO for tips, suggestions, and answers, this stands out as one of the best answers I've seen. Outstanding presentation of the concepts.
How did you get the MathBaseVisitor class? My grammar is called QueryParser, and I don't have a QueryParserBaseVisitor class. I have a QuertyParserBaseListener, but it's more like an in-out walker, which is practically impossible to work with. (I tried)
@Trejkaz this class is generated by ANTLR, just like the listener - you should have a QueryBaseVisitor (unless your grammar name includes that Parser part, but in that case you'll also have a QueryParserParser). IIRC the NuGet package generates it automatically, but if you run ANTLR manually, you have to add the -visitor option.
@Johannes I found the solution, I uploaded it here raw with all the generated output and dependencies. It wasn't meant to look pretty, almost everything is in the same file. The two Visit methods are unrelated, they happen to have the same name, but as both classes are visitors, well, they use the same name :)
|
10

I have created a small Java project that allows you to test your ANTLR grammar instantly by compiling the lexer and parser generated by ANTLR in-memory. You can just parse a string by passing it to the parser, and it will automatically generate an AST from it which can then be used in your application.

For the purpose of reducing the size of the AST, you could use a NodeFilter to which you could add the production-rule names of the non-terminals that you would like to be considered when constructing the AST.

The code and some code examples can be found at https://github.com/julianthome/inmemantlr

Hope the tool is useful ;-)

5 Comments

I tried using your 'small' project but you have hundreds of files with no comments, it is impossible to figure out what's going on. You have everything wrapped in your own version of wrapper functions. A user would have to download your entire project and use it as-is, and learn to use your new classes (GenericParser??). I can't use your code to figure out how to create my own AST in my own code.
Hi John inmemantlr's code base consists of 48 JavaClasses (find inmemantlr-api/src/main -name "*.java" | nl) most of which are pretty well-commented (javadoc.io/doc/com.github.julianthome/inmemantlr-api/1.3.9). To illustrate the points you've mentioned above (API usage, ParseTree creation), I have provided explanations in the README.md and provided test cases in github.com/julianthome/inmemantlr/tree/master/inmemantlr-api/…. However, in case you have issues with the tool, would be happy to help you. Please just drop me an email or create a an issue on github.
Could it be that you've pulled the grammars-v4 submodule (please have a look at inmemantlr-api/src/test/resources/grammars-v4)? Actually, this module is not part of inmemantlr's code-base; it is used to make sure that inmemantlr works on all the grammars-v4 grammars. However, the submodule is not pulled per default when executing git clone https://github.com/julianthome/inmemantlr.
@Julian amazing tools. It's really powerful and easy to use. Thanks a ton for sharing it with the community. Would like to see more examples in the wiki.
@VaibhavJain thank you very much. If you have some suggestions on how to improve the documentation/tool/API, I would be glad if you could create an issue on the github project page ;). Thanks again.
2

I have found two simple ways, focused on the functionality available in the TestRig.java file of antlr4.

  1. Via terminal

This is my example for parsing C++ with the corresponding CPP14.g4 grammar file java -cp .:antlr-4.9-complete.jar org.antlr.v4.gui.TestRig CPP14 translationunit -tree filename.cpp. If you omit the filename.cpp, the rig reads from stdin. "translationunit" is the start rule name of the CPP14.g4 grammar file I use.

  1. Via Java

I used parts of code found in the TestRig.java file. Let's suppose again that we have a string of C++ source code from which we want to produce the AST (you can also read directly from a file).

String source_code = "...your cpp source code..."; CodePointCharStream stream_from_string = CharStreams.fromString(source_code); CPP14Lexer lexer = new CPP14Lexer(new ANTLRInputStream(source_code)); CommonTokenStream tokens = new CommonTokenStream(lexer); CPP14Parser parser = new CPP14Parser(tokens); String parserName = "CPP14Parser"; ClassLoader cl = Thread.currentThread().getContextClassLoader(); Class<? extends Parser> parserClass = null; parserClass = cl.loadClass(parserName).asSubclass(Parser.class); String startRuleName = "translationunit"; //as specified in my CPP14.g4 file Method startRule = parserClass.getMethod(startRuleName); ParserRuleContext tree = (ParserRuleContext)startRule.invoke(parser, (Object[])null); System.out.println(tree.toStringTree(parser)); 

My imports are:

import java.lang.reflect.Method; import org.antlr.v4.runtime.CommonTokenStream; import org.antlr.v4.runtime.CharStreams; import org.antlr.v4.runtime.CodePointCharStream; import org.antlr.v4.runtime.ANTLRInputStream; import org.antlr.v4.runtime.ParserRuleContext; import org.antlr.v4.runtime.Parser; 

All these require that you have produced the necessary files (lexer, parser etc) with the command java -jar yournaltrfile.jar yourgrammar.g4 and that you have then compiled all the *.java files.

2 Comments

Why use reflection when the class/method are available at compile time?
Your imports don't include CPP14Parser or CPP14Lexer. Are we supposed to implement it ourselves or did you miss it?
0

I've converted the code of Terrence Parr's book "Language Implementation Pattern" for the pattern "Tree Grammar" which is based on antlr 3 (source-code under tpdsl-code/walking/tree-grammar) to antlr 4 using a Visitor and the "Homogeneous AST" pattern.

Here is the grammar:

VecMath.g4

// START: header grammar VecMath; tokens { VEC } // define imaginary token for vector literal // END: header // START: stat prog: stat+ ; // build list of stat trees stat: ID assign='=' expr #StatAssign // '=' is operator subtree root | print='print' expr #StatPrint // 'print' is subtree root ; // END: stat // START: expr expr: left=expr op=('*'|'.') right=expr #ExprMult // '*', '.' are roots | left=expr op='+' right=expr #ExprAdd // '+' is root node | '[' expr (',' expr)* ']' #ExprVec // VEC is root | INT #ExprInt | ID #ExprId ; // END: expr ID : 'a'..'z'+ ; INT : '0'..'9'+ ; WS : (' '|'\r'|'\n')+ -> skip ; 

The Visitor

package walking.v4.vecmath_ast.impl; import org.antlr.v4.runtime.CommonToken; import walking.v4.vecmath_ast.antlr.VecMathBaseVisitor; import walking.v4.vecmath_ast.antlr.VecMathParser.ExprAddContext; import walking.v4.vecmath_ast.antlr.VecMathParser.ExprContext; import walking.v4.vecmath_ast.antlr.VecMathParser.ExprIdContext; import walking.v4.vecmath_ast.antlr.VecMathParser.ExprIntContext; import walking.v4.vecmath_ast.antlr.VecMathParser.ExprMultContext; import walking.v4.vecmath_ast.antlr.VecMathParser.ExprVecContext; import walking.v4.vecmath_ast.antlr.VecMathParser.ProgContext; import walking.v4.vecmath_ast.antlr.VecMathParser.StatAssignContext; import walking.v4.vecmath_ast.antlr.VecMathParser.StatContext; import walking.v4.vecmath_ast.antlr.VecMathParser.StatPrintContext; import walking.v4.vecmath_ast.antlr.VecMathParser; public class VecMathBuildASTVisitor extends VecMathBaseVisitor<AST> { @Override public AST visitProg(ProgContext ctx) { AST ast = new AST(); for (StatContext stmt : ctx.stat()) { ast.addChild(visit(stmt)); } return ast; } @Override public AST visitStatAssign(StatAssignContext ctx) { AST ast = new AST(ctx.assign); ast.addChild(new AST(ctx.ID().getSymbol())); ast.addChild(visit(ctx.expr())); return ast; } @Override public AST visitStatPrint(StatPrintContext ctx) { AST ast = new AST(ctx.print); ast.addChild(visit(ctx.expr())); return ast; } @Override public AST visitExprMult(ExprMultContext ctx) { AST ast = new AST(ctx.op); ast.addChild(visit(ctx.left)); ast.addChild(visit(ctx.right)); return ast; } @Override public AST visitExprAdd(ExprAddContext ctx) { AST ast = new AST(ctx.op); ast.addChild(visit(ctx.left)); ast.addChild(visit(ctx.right)); return ast; } @Override public AST visitExprVec(ExprVecContext ctx) { AST ast = new AST(new CommonToken(VecMathParser.VEC, "VEC")); for (ExprContext expr : ctx.expr()) { ast.addChild(visit(expr)); } return ast; } @Override public AST visitExprId(ExprIdContext ctx) { AST ast = new AST(ctx.ID().getSymbol()); return ast; } @Override public AST visitExprInt(ExprIntContext ctx) { AST ast = new AST(ctx.INT().getSymbol()); return ast; } } 

The AST

Essentially just adjusted the Tokens compared to the original version in tpdsl-code/IR/Homo

package walking.v4.vecmath_ast.impl; import org.antlr.v4.runtime.CommonToken; import org.antlr.v4.runtime.Token; import walking.v4.vecmath_ast.antlr.VecMathParser; import java.util.ArrayList; import java.util.List; // Homogenous AST node type public class AST { Token token; // from which token do we create this node? List<AST> children; // normalized list of children public AST() { ; } // for making nil-rooted nodes public AST(Token token) { this.token = token; } /** Create node from token type; used mainly for imaginary tokens */ public AST(int tokenType) { this.token = new CommonToken(tokenType); } /** external visitors execute the same action for all nodes * with same node type while walking */ public int getNodeType() { return token.getType(); } public void addChild(AST t) { if (children == null) children = new ArrayList<>(); children.add(t); } public List<AST> getChildren() { return children; } /** to represent flat lists. A list is a subtree w/o a root, which we simulate * with a nil root node. A nil node is a node with token == null. */ public boolean isNil() { return token == null; } /** Compute string for single node */ public String toString() { String typeName = VecMathParser.VOCABULARY.getSymbolicName(getNodeType()); typeName = typeName == null ? token.getText() : typeName; return token != null ? "<" +typeName +", '" + token.getText() +"'>": "nil"; } /** Compute string for a whole tree */ public String toStringTree() { if (children == null || children.size() == 0) return this.toString(); StringBuffer buf = new StringBuffer(); if (!isNil()) { buf.append('('); buf.append(this.toString()); buf.append(' '); } for (int i = 0; i < children.size(); i++) { AST t = (AST) children.get(i); // normalized (unnamed) children if (i>0) buf.append(' '); buf.append(t.toStringTree()); } if (!isNil()) buf.append(')'); return buf.toString(); } } 

The Test class

package walking.v4.vecmath_ast; import org.antlr.v4.runtime.CommonTokenStream; import org.antlr.v4.runtime.tree.ParseTree; import org.antlr.v4.runtime.CharStream; import org.antlr.v4.runtime.CharStreams; import walking.v4.vecmath_ast.antlr.VecMathLexer; import walking.v4.vecmath_ast.antlr.VecMathParser; import walking.v4.vecmath_ast.impl.AST; import walking.v4.vecmath_ast.impl.VecMathBuildASTVisitor; public class Test { public static void main(String[] args) throws Exception { CharStream input = CharStreams.fromFileName(args[0]); VecMathLexer lexer = new VecMathLexer(input); CommonTokenStream tokens = new CommonTokenStream(lexer); VecMathParser parser = new VecMathParser(tokens); ParseTree tree = parser.prog(); for (AST ast : new VecMathBuildASTVisitor().visit(tree).getChildren()) { System.out.println(ast.toStringTree()); } } } 

Test input

x = 3 + 4 y = 3 + 4 + 5 a = 3 * 4 a = 3 * 4 * 5 c = 3 * 4 + 5 print x * [2, 3, 4] print x * [2+5, 3, 4] 

yields:

(<=, '='> <ID, 'x'> (<+, '+'> <INT, '3'> <INT, '4'>)) (<=, '='> <ID, 'y'> (<+, '+'> (<+, '+'> <INT, '3'> <INT, '4'>) <INT, '5'>)) (<=, '='> <ID, 'a'> (<*, '*'> <INT, '3'> <INT, '4'>)) (<=, '='> <ID, 'a'> (<*, '*'> (<*, '*'> <INT, '3'> <INT, '4'>) <INT, '5'>)) (<=, '='> <ID, 'c'> (<+, '+'> (<*, '*'> <INT, '3'> <INT, '4'>) <INT, '5'>)) (<print, 'print'> (<*, '*'> <ID, 'x'> (<VEC, 'VEC'> <INT, '2'> <INT, '3'> <INT, '4'>))) (<print, 'print'> (<*, '*'> <ID, 'x'> (<VEC, 'VEC'> (<+, '+'> <INT, '2'> <INT, '5'>) <INT, '3'> <INT, '4'>))) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.