tokenizing mathematical equation using regex

Question

I'm trying to split up an equation string into tokens. Ive found a good starting point '([A-Za-z]+|[0-9.]+|[&=><\|!]+|\S)'. However this has trouble with negative numbers:

turns: '5--4=sin(2+3)' into: ['5','-','-','4','=','sin','(','2','+','3',')'] want: ['5','-','-4','=','sin','(','2','+','3',')']

and also

turns: -3+3 into: ['-','3','+','3'] want: ['-3','+','3']

It looks like a my regex could use something that checks if there is a number to the left of the '-' if not keep it with the next number(note '-3' has nothing to the left). Can it be done using regex? Or is there a better tool to split this up in .NET?

Community · Accepted Answer · 2017-05-23 12:33:53Z

You are not approaching the problem correctly. The result you actually got is the correct one.

-3+3 should parse to:

operator binary + | +-- operator unary - | | | +-- 3 | +-- 3

It will be much easier to reason about math expressions this way, you'll avoid many ambiguities. Let just - always be a token on its own, and use it either as a binary minus, or an unary negation operator.

See here for a related answer of mine which approaches the problem this way (it uses ANTLR but the lexing pass does exactly what I'm advising you to do).

Sergey Kalinichenko · Accepted Answer · 2017-01-12 20:21:19Z

1

Regex is not powerful enough to do what you want in all contexts. Although you can make regex recognize + or - as part of an integer literal, for example, by adding an optional [+-]? in front of a digit sequence, the resultant regex would opt to tokenize '-3+3' as ['-3', '+3'] (demo).

Using a lexer generator should fix this problem; alternatively, you can deal with "bundling" unary operators with their operands in the parser.

answered Jan 12, 2017 at 20:21

Sergey Kalinichenko

729k85 gold badges1.2k silver badges1.6k bronze badges

7 Comments

Lucas Trzesniewski Over a year ago

Oh come on, regex is perfectly suited for lexing - it's a Chomsky type 3 problem. OP doesn't realize that the result he got is actually exactly what he needs. in - 3, the - is actually the unary negation operator.

Sergey Kalinichenko Over a year ago

@LucasTrzesniewski Of course, regex is perfectly suited for lexing, but OP wants his lexing to be context-sensitive. He wants two minuses in -3-3 treated differently, which is neither what he wants nor what regex can deliver.

Lucas Trzesniewski Over a year ago

Yes, after reading your answer for a second time now I get what you meant. BTW a lexer generator won't just magically fix the "problem" either, most just use regex under the hood ;)

Sergey Kalinichenko Over a year ago

@LucasTrzesniewski I had ANTLR 3 in mind: I remember it producing some nice "magic" for me in terms of lexing complex expressions. I think they do it by adding an extra layer on "smartness" top of regular expressions, though.

Lucas Trzesniewski Over a year ago

Hmmm... there's no such thing in ANTLR as far as I can tell. It lets you inject code into lexer expressions though by using inline blocks {...} or predicates {...}? (I don't remember if the v3 lexer used {...}? or {...}?=> though), so you can achieve custom magic that way.

|

Collectives™ on Stack Overflow

tokenizing mathematical equation using regex

2 Answers 2

Comments

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Linked

Related