4

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:

N time ========= 8 2.2 9 2.2 10 2.2 11 2.2 12 2.3 13 4.6 14 9.5 15 11.6 16 12.7 20 12.9 

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)

Here's the code:

#include <cstdint> #include <iostream> volatile uint64_t init[2] = { 314159165, 27182818 }; // volatile uint64_t init[2] = { 0, 0 }; volatile uint64_t one = 1; uint64_t next(uint64_t s[2]) { uint64_t s1 = s[0]; uint64_t s0 = s[1]; uint64_t result = s0 + s1; s[0] = s0; s1 ^= s1 << 23; s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); return result; } int main() { uint64_t s[2]; s[0] = init[0]; s[1] = init[1]; uint64_t sum = 0; #if 1 const int N = 16; unsigned char buffer[1<<N]; for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1; for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) { for (int j=0; j<1<<N; j++) { if (buffer[j]) { sum += one; } } } #else for (uint64_t i=0; i<uint64_t(1)<<32; i++) { if (next(s)&1) { sum += one; } } #endif std::cout<<sum<<"\n"; } 

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

 401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0 401274: 74 07 je 40127d <main+0xad> 401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one> 40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0 401282: 74 07 je 40128b <main+0xbb> 401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one> 40128b: 48 83 c1 02 add rcx,0x2 40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000 401296: 75 d8 jne 401270 <main+0xa0> 
8
  • 1
    Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history. Commented Mar 28, 2019 at 12:35
  • I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16. Commented Mar 28, 2019 at 12:40
  • 1
    In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation. Commented Mar 31, 2019 at 13:56
  • 1
    @Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out perf stat. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more. Commented Mar 31, 2019 at 14:06
  • 2
    @Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not. Commented Apr 1, 2019 at 9:48

1 Answer 1

2

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N time cycles branch-misses (%) approx-time =============================================================== 8 2.2 9,084,889,375 34,806 ( 0.00) 2.2 9 2.2 9,212,112,830 39,725 ( 0.00) 2.2 10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2 11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2 12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3 13 4.6 19,572,398,825 486,814,972 (11.33) 4.6 14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5 15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7 16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7 20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9 Note: branch-misses (%) is calculated for 2^32 branches 

As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.

The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles. 
Sign up to request clarification or add additional context in comments.

15 Comments

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop.
@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only.
Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition if (buffer[j]) is always true (easily predicted)...
...This allows to estimate the cost of a single inner iteration where if (buffer[j]) is correctly predicted. Multiply this by the number of correct predictions of if (buffer[j]) and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch if (buffer[j]) was mispredicted. The result is the average cost of mispredicting if (buffer[j]).
@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? float vs double, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074).
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.