Computed gotos make virtual machines faster at dispatching instructions.
Consider the "normal" way of dispatching via a looped switch statement:
while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } } This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.
The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!
Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:
// opcodes must be in consecutive order for this array to be valid void* table[] = {&&op_add, &&op_sub, &&op_mul, &&op_div}; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!
This technique is sometimes referred to as direct threadingindirect threading (not to be confused with multithreading).
CPython's virtual machine will use computed gotos if possible, noting:
At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.