Branch prediction optimizations

Question

I am trying to understand what kind of magic optimizations gcc/clang does with this code.

#include <random> #include <iostream> int main() { std::random_device rd; std::mt19937 mt(rd()); const unsigned arraySize = 100000; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = mt() % 256; long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { for (unsigned c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } std::cout << sum << std::endl; }

and this code

#include <random> #include <iostream> #include <algorithm> int main() { std::random_device rd; std::mt19937 mt(rd()); const unsigned arraySize = 100000; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = mt() % 256; std::sort(data, data + arraySize); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { for (unsigned c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } std::cout << sum << std::endl; }

Basicly when I compiled it and run about 3 years ago, the second code was like 4x faster because of much better branch prediction. When I compile it and run now, it works in almost the same time, and I have no idea what kind of sorcery gcc/clang does.

Have you looked at the assembler output (fancy posting it?) - gcc.godbolt.org I'm also assuming you're comparing apples to apples and you've tried the output from the old and new compilers on the same hardware with the same libraries? — Joe
– Joe, Commented Mar 16, 2015 at 11:49
In the second example you are sorting the array. By doing that once you start hitting the data that is above 127 then you will always be calling the += operator. This operation should be getting cached since you are doing it every iteration. I have read a very nice tutorial that explains the benefits of keeping memory and operations "hot" to improve performance. — NathanOliver
– NathanOliver, Commented Mar 16, 2015 at 12:19
A common misconception is that, this is branch predication, not prediction. — user3528438
– user3528438, Commented Mar 16, 2015 at 12:44

Ville Krumlinde · Accepted Answer · 2015-03-16 12:57:10Z

Here is the output from gcc (using gcc.godbolt.org, with -O3)

.L4: //Inner loop movslq (%rax), %rdx movq %rdx, %rcx addq %rsi, %rdx cmpl $127, %ecx cmovg %rdx, %rsi addq $4, %rax cmpq %rdi, %rax jne .L4

You can see it does the comparison "cmpl $127,$ecx", however after the comparison it does not branch. Instead it always adds (using "addq", in the line above the comparison) and then uses the result of the add depending on the comparison (thanks to the "cmovg" "conditional move" instruction).

It avoids branching in the inner loop so the performance is not depending on branch prediction. Because of this, it makes no difference to sort the input (as you do in the second example).

Collectives™ on Stack Overflow

Branch prediction optimizations

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related