Branch Prediction at no cost?

Question

I've just stumbled upon this thing, and I'm really curious if maybe modern CPUs (current ones, maybe mobile ones as well (embedded)) don't actually have a branching cost in the situation below.

1.Let's say we have this:

x += a; // let's assume they are both declared earlier as simple ints if (flag) do A // let's assume A is not the same as B else do B // and of course B is different than A

2.Compared to this:

if (flag) { x += a do A } else { x += a do B }

Assuming A and B are completely different in therms of pipeline instructions (fetch, decode, execute, etc):

Is the 2nd approach going to be faster ?
Are CPUs smart enough to tell that no matter what the flag is, the next instruction is the same (so they won't have to discard pipeline stages for it because of branch miss prediction ) ?

Note:

In the first case the CPU has no option, but to discard the first few pipeline stages of the do A or do B if a branch miss prediction happened, because they are different. I see the 2nd example as a somehow delayed branching like: " I'm going to check that flag, even if I don't know the flag, I can get on with the next instruction because it's the same, no matter what the flag is, I already have the next instruction and it's OK for me to use it."

EDIT:
I did some research and I have some nice results. How would you explain this behavior ? Sorry for my latest edit, but I had some cache problems as far as I could see, these are more accurate results and code samples, I hope.

Here is the code, compiled with gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) using -O3.

Case 1.

#include <stdio.h> extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { ++*cache; *x += *a; if (*b) { A(); } else { B(); } } delete b; delete x; delete a; delete loop; delete cache; return 0; } int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

Case 2

#include <stdio.h> extern int * cache; extern bool * b; extern int * x; extern int * a; extern unsigned long * loop; extern void A(); extern void B(); int main() { for (unsigned long i = 0; i < *loop; ++i) { ++*cache; if (*b) { *x += *a; A(); } else { *x += *a; B(); } } delete b; delete x; delete a; delete loop; delete cache; return 0; } int * cache = new int(0); bool * b = new bool(true); int * x = new int(0); int * a = new int(0); unsigned long * loop = new unsigned long(0x0ffffffe); void A() { --*x; *b = false; } void B() { ++*x; *b = true; }

There is pretty much unnoticeable difference between the -O3 versions of both approaches, but without -O3, second case does run slightly faster, at least on my machine. I have tested without -O3 and with the loop = 0xfffffffe.
Best times:
alin@ubuntu:~/Desktop$ time ./1

real 0m20.231s
user 0m20.224s
sys 0m0.020s

alin@ubuntu:~/Desktop$ time ./2

real 0m19.932s
user 0m19.890s
sys 0m0.060s

Such things are generally optimized by compilers, not at execution/CPU level. — Rohan
– Rohan, Commented Sep 27, 2015 at 9:58
I suspect compiler optimizer would do its job and factor that out to yield same code. — Non-maskable Interrupt
– Non-maskable Interrupt, Commented Sep 27, 2015 at 9:58
PS: thank you for the code edit (it's my very first post, sorry about that). So in other words, I could write case 2 as 1 and trust the compiler to notice this ? — Alin Ionut Lipan
– Alin Ionut Lipan, Commented Sep 27, 2015 at 10:07
@Calvin Factoring out the common code would defeat the optimization attempt. — melpomene
– melpomene, Commented Sep 27, 2015 at 10:12
@AlinIonutLipan: I haven't seen compilers on x86 machines doing this (transform case 1 to case 2,) but I have seen thin on RISC machines decades ago (but not exactly like this.) And that indeed was being done by the compiler. Generally speaking, you can not depend on compiler optimization too much, but this one is a relatively simple and obvious pinhole optimization. I'd recommend always writing case 1 though, as it is easier for the compiler to do. — yzt
– yzt, Commented Sep 27, 2015 at 10:17

OmnipotentEntity · Accepted Answer · 2015-09-27 10:36:42Z

There are two parts to this:

First, does the compiler optimize this?

Let's run an experiment:

test.cc

#include <random> #include "test2.h" int main() { std::default_random_engine e; std::uniform_int_distribution<int> d(0,1); int flag = d(e); int x = 0; int a = 1; if (flag) { x += a; doA(x); return x; } else { x += a; doB(x); return x; } }

test2.h

void doA(int& x); void doB(int& x);

test2.cc

void doA(int& x) {} void doB(int& x) {}

test2.cc and test2.h both exist solely to prevent the compiler from optimizing away everything. The compiler cannot be certain that there isn't a side effect because these functions exist in another translation unit.

Now we compile to assembly:

gcc -std=c++11 -S test.cc

And let's jump to the part of the assembly that's interesting:

 call _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_ movl %eax, -40(%rbp); <- setting flag movl $0, -44(%rbp); <- setting x movl $1, -36(%rbp); <- setting a cmpl $0, -40(%rbp); <- first part of if (flag) je .L2; <- second part of if (flag) movl -44(%rbp), %edx <- setting up x movl -36(%rbp), %eax <- setting up a addl %edx, %eax <- adding x and a movl %eax, -44(%rbp) <- assigning back to x leaq -44(%rbp), %rax <- grabbing address of x movq %rax, %rdi <- bookkeeping for function call call _Z3doARi <- function call doA movl -44(%rbp), %eax jmp .L4 .L2: movl -44(%rbp), %edx <- setting up x movl -36(%rbp), %eax <- setting up a addl %edx, %eax <- perform the addition movl %eax, -44(%rbp) <- move it back to x leaq -44(%rbp), %rax <- and so on movq %rax, %rdi call _Z3doBRi movl -44(%rbp), %eax .L4:

So we can see that the compiler didn't optimize it. But we also didn't actually ask it to.

g++ -std=c++11 -S -O3 test.cc

and then the interesting assembly:

main: .LFB4729: .cfi_startproc subq $56, %rsp .cfi_def_cfa_offset 64 leaq 32(%rsp), %rdx leaq 16(%rsp), %rsi movq $1, 16(%rsp) movq %fs:40, %rax movq %rax, 40(%rsp) xorl %eax, %eax movq %rdx, %rdi movl $0, 32(%rsp) movl $1, 36(%rsp) call _ZNSt24uniform_int_distributionIiEclISt26linear_congruential_engineImLm16807ELm0ELm2147483647EEEEiRT_RKNS0_10param_typeE testl %eax, %eax movl $1, 12(%rsp) leaq 12(%rsp), %rdi jne .L83 call _Z3doBRi movl 12(%rsp), %eax .L80: movq 40(%rsp), %rcx xorq %fs:40, %rcx jne .L84 addq $56, %rsp .cfi_remember_state .cfi_def_cfa_offset 8 ret .L83: .cfi_restore_state call _Z3doARi movl 12(%rsp), %eax jmp .L80

This is a bit beyond my ability to cleanly show a 1 to 1 relationship between the assembly and the code, but you can tell from the calls to doA and doB that the setup is all common and done outside of the if statement. (Above the line jne .L83). So yes, compilers do perform this optimization.

Part 2:

How can we know if CPUs do this optimization if given the first code?

I'm actually not aware of a way to test this. So I do not know. I'd rate it as plausible given that out of order and speculative execution exists. But the proof is in the pudding, and I don't have a way of testing this pudding. So I'm reluctant to make a claim one way or another.

The same explanation with equivalent C code would be less confusing.
The only real differences would be lack of name mangling and different random function name calls. This is fine imo. I skipped over most of the setup in both cases.
Thank you for your answer, and yes I understand that we should always write case 1 with no fuss. I was wondering if is it possible for case 2 to be faster than case 1 (let's assume the compiler knows nothing about the values, let's assume we had pointers all over the place and the compiler can't know the side effects just yet). Without knowing how could he possibly optimize case 1 ? I'm gonna do some testing myself and see if can case 2 be any faster and if so, by how much.
I've only tested case 2 to show that it will compile to something semantically equivalent to case 1. With the limited example you gave, I can't see how case 2 could possibly be faster than case 1 (only equal to). Perhaps you can give more detail?
That's what I mean, name mangling and is confusing to non C++ programmers, the question being tagged C as well, flag = rand(); would be simple enough.

Alan Stokes · Accepted Answer · 2015-09-27 10:17:06Z

Back in the day CPUs explicitly supported something a bit like this - after a branch instruction the next instruction would always be executed whether or not the branch was actually taken (look up "branch delay slot").

I'm pretty sure modern CPUs just dump the whole pipeline on a branch mispredict. There's no point attempting to do the optimisation you suggest at execution time when the compiler can easily do it at compilation time.

Ah, I was just trying to remember the name "delay slot" to post almost exactly the same answer as yours. :D
Thank you, I didn't knew about the delay slot, that seems to be exactly the info I was missing :) So I see no point then in writing the unclean case 2.
Write whatever is clearest in the circumstances - which will usually be 1.

Collectives™ on Stack Overflow

Branch Prediction at no cost?

Note:

2 Answers 2

test.cc

test2.h

test2.cc

8 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

Note:

2 Answers 2

test.cc

test2.h

test2.cc

8 Comments

3 Comments

Related