Better diagnostics through operator-precedence parsing

All C-style programming languages have too many levels of operator precedence. This confuses their programmers, who have to remember not just the basic PEMDAS stuff but the really arcane precedences, too:

x & y == z doesn’t mean (x & y) == z but rather x & (y == z). That is, it generally Does The Wrong Thing.
x == y == z and x < y < z have the “correct” natural precedence, but if you see one of these in code, it’s almost certainly a bug, because it’s not asking whether x, y, and z have the same value; it’s asking whether (x == y) has the same value as z!
Some expressions are simply bizarre. I’d bet you have no intuition for the meaning of x < y | z, do you? Of course you can reason that < has roughly the precedence of ==, and | has roughly the precedence of &, so this should parse roughly the same as x == y & z. But that’s not intuition, it’s reasoning: a conscious process that requires active engagement.
!x && y is another expression that’s easy to reason about, once you’re on the lookout for it; but at a glance it’s easy to mistake for !(x && y).
Even limiting ourselves to simple arithmetic, it’s easy to misinterpret an expression like x / y * z.

Jonathan Müller wrote on this subject in “Operator precedence is broken” (July 2017); it may be worth your time to go read his post before tackling the rest of this one.

All five of the problems above are solvable within the operator-precedence subsystem! The key observation is that all of these ways of “making an expression problematic” boil down to that the expression contains two adjoining operators whose resolution is non-obvious: & next to ==, say, or / next to * (but not the reverse).

By “next to,” of course I don’t mean that the operators must appear lexically consecutive in the source code: we also want to diagnose x / a->b * z, despite the -> operator’s coming lexically between / and *. It turns out that the meaning of “next to” is exactly captured by this definition:

Two operators @ and $ within the same expression are said to be “next to” each other if parsing the expression with an operator-precedence parser requires at some point comparing the relative precedences of @ and $.

In this post I construct a simple operator-precedence parser using Dijkstra’s shunting-yard algorithm to parse an expression grammar almost, but not quite, entirely unlike C’s; and then modify it to diagnose all five of the issues above (and many more).

Lexer

Here’s a simple lexer for a C-style expression grammar. A convenient thing about C’s lexer is that every operator token is either a single byte, or else a string where every prefix of that string is also a valid operator. For example, <<= is a valid token; and so are its prefixes << and <. Therefore our lexer never needs to “look ahead” more than a single character.

Since this lexer is merely an uninteresting prerequisite for the fun parsing part, we’ll impose no structure on our non-operator tokens: alphanumeric tokens like 42, 0x7, foo, if, 5g are all treated uniformly. In fact, since we use Python’s built-in isalnum, we’ll consider underscore to be an (unrecognized) operator, not an identifier.

OPERATORS = [ '+', '-', '*', '/', '%', '<', '<=', '>', '>=', '==', # etc. ] class Lexer: def tokenize(self, chars): self.token = '' for ch in chars: if ch.isspace(): yield from self.emit_if(True) elif ch.isalnum(): yield from self.emit_if(not self.token.isalnum()) self.token += ch else: yield from self.emit_if(self.token.isalnum()) yield from self.emit_if((self.token + ch) not in OPERATORS) self.token += ch yield from self.emit_if(True) def emit_if(self, b): if b and self.token: yield self.token self.token = '' 

Test with:

while True: line = input() tokens = Lexer().tokenize(line) print(' '.join(tokens)) 

For example, entering “abc*d<=-ef/g” should print “abc * d <= - ef / g.”

Parser (without diagnostics)

Let’s make a simple shunting-yard parser. It will consume a stream of tokens in infix order (such as the stream generated by our previous step’s lexer) and produces a stream of the same tokens rearranged in postfix (RPN) order. In the input stream unary and binary - are represented identically; but in the output stream we’ll distinguish unary operators by the prefix U. Thus the output “a b U- -” will correspond to the input “a - -b,” while the output “a b - U-” will correspond to the input “-(a-b).”

The parser needs to know the precedence of each operator. We encapsulate that comparison in a helper function. Notice that unary operators are given to has_higher_precedence already encoded with the U prefix.

PREC = { k: v for v, ks in enumerate([ ['||'], ['&&'], ['|'], ['^'], ['&'], ['==', '!='], ['<', '<=', '>', '>='], ['<=>'], ['<<', '>>'], ['+', '-'], ['*', '/', '%'], ['.*', '->*'], ['U'+t for t in ['+', '-', '!', '~', '*', '&']], ['.', '->'], ]) for k in ks } PREC['('] = -float('inf') def has_higher_precedence(a, b): return PREC[a] >= PREC[b] 

This code assumes every operator is left-associative. Supporting right-associative operators such as = or ** or => simply requires inspecting the associativity of a:

def has_higher_precedence(a, b): if a in ['=', '**', '=>']: return PREC[a] > PREC[b] else: return PREC[a] >= PREC[b] 

Now the parser itself goes like this:

def infix_to_postfix(tokens): stack = [] expect_primary = True for token in tokens: if token.isalnum(): if not expect_primary: raise ValueError('got "%s" when a binary operator was expected' % token) expect_primary = False yield token elif token == '(': if not expect_primary: raise ValueError('got "(" when a binary operator was expected') stack.append(token) elif token == ')': if expect_primary: raise ValueError('got ")" when a primary-expression was expected') while stack and stack[-1] != '(': yield stack.pop() if not stack: raise ValueError('right parenthesis with no preceding left parenthesis') stack.pop() elif expect_primary: if ('U' + token) not in PREC: raise ValueError('unknown unary operator "%s"' % token) stack.append('U' + token) else: if token not in PREC: raise ValueError('unknown binary operator "%s"' % token) while stack and has_higher_precedence(stack[-1], token): yield stack.pop() stack.append(token) expect_primary = True while stack: if stack[-1] == '(': raise ValueError('left parenthesis with no matching right parenthesis') yield stack.pop() 

Test with:

while True: line = input() tokens = Lexer().tokenize((ch for ch in line)) try: print(' '.join(infix_to_postfix(tokens))) except ValueError as e: print('error:', e) 

For example, entering “abc*d<=-ef/g” should print “abc d * ef U- g / <=.”

Adding diagnostics

Since every pair of “adjoining” operators passes through has_higher_precedence, it’s simple to diagnose all the problems I listed at the top of this post. At first we might think to do it via a blacklist of known-problematic pairs:

def has_higher_precedence(a, b): if a in ['U!'] and b in ['&&', '||']: print("warning: (%sx %s y) is ambiguous" % (a, b)) elif a in ['/', '%'] and b in ['*']: print("warning: (x %s y %s z) is ambiguous" % (a, b)) elif a in ['&', '^', '|'] and b in ['==', '!=', '<', '<=', '>', '>=']: print("warning: (x %s y %s z) is ambiguous" % (a, b)) elif a in ['<', '<='] and b in ['<', '<=']: print("warning: (x %s y %s z) doesn't mean what you think" % (a, b)) return PREC[a] >= PREC[b] 

However, this leaves you vulnerable to failures of imagination. For example, it seems to me that we should also warn about x & y << z (which means x & (y << z), not (x & y) << z). Where in the maze of ifs above should that case be inserted?

A better approach is to diagnose anything that’s not on a whitelist of uncontroversially non-problematic pairs! Out of the roughly 667 operator-pairs that our has_higher_precedence might ever see, only about half of them are non-problematic in my book.

Some pairs are nonsensical. For example, as Jonathan noted, !p->*x has the “wrong” precedence, so that it will not type-check; therefore we can omit the pair (U!, ->*) from our whitelist. Other pairs, such as (+, &&), seem nonsensical at first glance but in fact we must whitelist them anyway. Consider the expression:

a < b + 1 && c < d

In parsing this expression we call has_higher_precedence with the following pairs: (<, +); (+, &&); (<, &&); (&&, <). If (+, &&) weren’t on our whitelist, we’d get a warning — which we don’t want! We certainly should try to diagnose “unusually typed” expressions such as b + 1 && c; but we can’t do it with this operator-precedence technique alone.

However, it seems to me that we can safely omit the pair (<<, &&) from our whitelist: those two operators really should never appear next to one another. In fact I think << should always be parenthesized; I can’t think of any operator (between the extremes of . and =) where x << y @ z is both clear and useful. The useful expressions, such as x << y + 1, are not clear; and the clear expressions, such as x << y && z, are not useful.

My suggested whitelist ends up looking like this:

UNPROBLEMATIC = sum([ [('||',r) for r in ['||', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']], [('&&',r) for r in ['&&', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']], [(x,x) for x in '|^&'], [(l,r) for l in ['==', '!=', '<', '<=', '>', '>='] for r in ['||', '&&', '+', '-', '*', '/', '%']], [(l,r) for l in ['+', '-', '*'] for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']], [('/',r) for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=', '+', '-']], [('%',r) for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=']], [(l,r) for l in ['U+', 'U-', 'U~', 'U*', 'U&'] for r in ['||', '&&', '|', '^', '&', '==', '!=', '<', '<=', '>', '>=', '<=>', '<<', '>>', '+', '-', '*', '/', '%']], ], []) def has_higher_precedence(a, b): if a in ['==', '!='] and (a == b): print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b)) elif a in ['<', '<='] and b in ['<', '<=']: print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b)) elif a in ['>', '>='] and b in ['>', '>=']: print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b)) elif a[0] == 'U' and b in ['.*', '->*']: print('warning: (%sx%sy) means (%sx)%sy' % (a[1:], b, a[1:], b)) elif a in ['(', '.', '->', '.*', '->*'] or b in ['.', '->', '.*', '->*']: pass # not problematic elif (a,b) in UNPROBLEMATIC: pass # not problematic elif a[0] == 'U': print('warning: (%sx %s y) is ambiguous; consider adding parentheses' % (a[1:], b)) else: print('warning: (x %s y %s z) is ambiguous; consider adding parentheses' % (a, b)) return PREC[a] >= PREC[b] 

Test it against some of Jonathan’s exotic expressions and it spews warnings — but I don’t think any of these warnings are “wrong”!

a & b + c * d && e ^ f == 7 warning: (x & y + z) is ambiguous; consider adding parentheses warning: (x & y && z) is ambiguous; consider adding parentheses warning: (x && y ^ z) is ambiguous; consider adding parentheses warning: (x ^ y == z) is ambiguous; consider adding parentheses a b c d * + & e f 7 == ^ && arr + 32 < ~a | b warning: (x < y | z) is ambiguous; consider adding parentheses arr 32 + a U~ < b | !x && y warning: (!x && y) is ambiguous; consider adding parentheses x U! y && 

Get the full Python code here.