I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType { INT("-?[0-9]+"), BOOLEAN("(TRUE|FALSE)"), PLUS("\\+"), MINUS("\\-"), //others..... } The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() { StringBuffer tokenPatternsBuffer = new StringBuffer(); for(TokenType token : TokenType.values()) tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")"); String tokenPatternsString = tokenPatternsBuffer.toString().substring(1); return tokenPatternsString; } So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT).... Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING); Then I create a Matcher
Matcher match = pattern.match("line of code"); Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() { ArrayList<Token> tokens = new ArrayList<Token>(); int tokenSize = TokenType.values().length; int counter = 0; //Iterate over the arrayLinee (ArrayList of String) to get matches of pattern for(String linea : arrayLinee) { counter = 0; Matcher match = pattern.matcher(linea); while(match.find()) { System.out.println(match.group(1)); counter = 0; for(TokenType token : TokenType.values()) { counter++; if(match.group(token.name()) != null) { tokens.add(new Token(token , match.group(token.name()))); counter = 0; continue; } } if(counter==tokenSize) { System.out.println("Syntax Error in line : " + linea); break; } } tokenList.add("EOL"); } } The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?