Revisions to What are the pros and cons of goto?

Corrected description of direct-vs-indirect threading.

edited Jun 30, 2023 at 14:01

681
3
11

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = {&&op_add, &&op_sub, &&op_mul, &&op_div}; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading indirect threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = {&&op_add, &&op_sub, &&op_mul, &&op_div}; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = {&&op_add, &&op_sub, &&op_mul, &&op_div}; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as indirect threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Corrected syntax for table

Source Link

edited May 19, 2023 at 13:13

chrisaycock

681
3
11

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = [op_add{&&op_add, op_sub&&op_sub, op_mul&&op_mul, op_div];&&op_div}; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = [op_add, op_sub, op_mul, op_div]; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = {&&op_add, &&op_sub, &&op_mul, &&op_div}; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Branch prediction

Source Link

edited May 19, 2023 at 3:37

chrisaycock

681
3
11

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode. Notice that there are three branches

The switch statement is, ultimately, the single point of dispatch for each handledall opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, not counting the function calls:branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

switch jumps to the label

break branches to the end of the switch statement

the end of the while loop branches back to the top

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = [op_add, op_sub, op_mul, op_div]; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Note thatHere there is only one branch! Each opcode'sare multiple points of dispatch because each opcode handler dispatches immediately to the next opcode's handleropcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode. Notice that there are three branches for each handled opcode, not counting the function calls:

switch jumps to the label

break branches to the end of the switch statement

the end of the while loop branches back to the top

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = [op_add, op_sub, op_mul, op_div]; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Note that there is only one branch! Each opcode's handler dispatches immediately to the next opcode's handler.

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Computed gotos make virtual machines faster at dispatching instructions.

Consider the "normal" way of dispatching via a looped switch statement:

while (true) { switch (code[pc++]) { // dispatch next opcode case OP_ADD: add(); break; case OP_SUB: sub(); break; case OP_MUL: mul(); break; case OP_DIV: div(); break; } }

This iterates over a program counter within the user's bytecode and invokes arithmetic functions based on the observed opcode.

The switch statement is, ultimately, the single point of dispatch for all opcodes. That means that a CPU's branch predictor can't really determine a pattern to the next opcode. Consider that over the past few handled instructions within a user's bytecode, the branch predictor will have seen multiple different opcodes. Thus, the prediction for the next switch target cannot be focused onto a single opcode, which means that there will be no (correct) branch prediction!

Fortunately there is a non-standard extension to C/C++ that some compilers implement. Computed gotos allow programs to reference labels directly:

// opcodes must be in consecutive order for this array to be valid void* table[] = [op_add, op_sub, op_mul, op_div]; // start the first dispatch goto *table[code[pc++]]; op_add: add(); goto *table[code[pc++]]; // dispatch next opcode op_sub: sub(); goto *table[code[pc++]]; // dispatch next opcode op_mul: mul(); goto *table[code[pc++]]; // dispatch next opcode op_div: div(); goto *table[code[pc++]]; // dispatch next opcode

Here there are multiple points of dispatch because each opcode handler dispatches to the next opcode in the user's bytecode. If the user's bytecode is part of a tight loop, then the branch predictor will see a pattern of the current opcode based solely on the previous opcode. That means prediction is more focused (the current opcode determines the next opcode), which means there will be a (likely correct) branch prediction!

This technique is sometimes referred to as direct threading (not to be confused with multithreading).

CPython's virtual machine will use computed gotos if possible, noting:

At the time of this writing, the "threaded code" version is up to 15-20% faster than the normal "switch" version, depending on the compiler and the CPU architecture.

Source Link

answered May 19, 2023 at 1:48

chrisaycock

681
3
11

Loading

Stack Exchange Network

Return to Answer