3

I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum

public enum TokenType { INT("-?[0-9]+"), BOOLEAN("(TRUE|FALSE)"), PLUS("\\+"), MINUS("\\-"), //others..... } 

The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".

But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:

private String pattern() { StringBuffer tokenPatternsBuffer = new StringBuffer(); for(TokenType token : TokenType.values()) tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")"); String tokenPatternsString = tokenPatternsBuffer.toString().substring(1); return tokenPatternsString; } 

So it returns a String like:

(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT).... 

Now i use this string to create a Pattern

Pattern pattern = Pattern.compile(STRING); 

Then I create a Matcher

Matcher match = pattern.match("line of code"); 

Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.

private void lex() { ArrayList<Token> tokens = new ArrayList<Token>(); int tokenSize = TokenType.values().length; int counter = 0; //Iterate over the arrayLinee (ArrayList of String) to get matches of pattern for(String linea : arrayLinee) { counter = 0; Matcher match = pattern.matcher(linea); while(match.find()) { System.out.println(match.group(1)); counter = 0; for(TokenType token : TokenType.values()) { counter++; if(match.group(token.name()) != null) { tokens.add(new Token(token , match.group(token.name()))); counter = 0; continue; } } if(counter==tokenSize) { System.out.println("Syntax Error in line : " + linea); break; } } tokenList.add("EOL"); } } 

The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?

0

4 Answers 4

2

All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.

Sign up to request clarification or add additional context in comments.

Comments

1

If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.

1 Comment

It's an University project and i can't use other libraries or tools external to Java. :( This is the reason why I have to code entire Lexer and Parser for a BASIC's dialect.
0

If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.

1 Comment

It's an University project and i can't use other libraries or tools external to Java. :( This is the reason why I have to code entire Lexer and Parser for a BASIC's dialect.
-1

Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.

In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.

Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.

This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.