22

I hope that I have reduced my question to a simple and reproducible test case. The source (which is here) contains 10 copies of an identical simple loop. Each loop is of the form:

#define COUNT (1000 * 1000 * 1000) volatile uint64_t counter = 0; void loopN(void) { for (int j = COUNT; j != 0; j--) { uint64_t val = counter; val = val + 1; counter = val; } return; } 

The 'volatile' of the variable is important, as it forces the value to read and written from memory on each iteration. Each loop is aligned to 64 bytes using '-falign-loops=64' and produces identical assembly except for the relative offset to the global:

 400880: 48 8b 15 c1 07 20 00 mov 0x2007c1(%rip),%rdx # 601048 <counter> 400887: 48 83 c2 01 add $0x1,%rdx 40088b: 83 e8 01 sub $0x1,%eax 40088e: 48 89 15 b3 07 20 00 mov %rdx,0x2007b3(%rip) # 601048 <counter> 400895: 75 e9 jne 400880 <loop8+0x20> 

I'm running Linux 3.11 on an Intel Haswell i7-4470. I'm compiling the program with GCC 4.8.1 and the command line:

 gcc -std=gnu99 -O3 -falign-loops=64 -Wall -Wextra same-function.c -o same-function 

I'm also using attribute((noinline)) within the source to make the assembly clearer, but this is not necessary to observe the issue. I find the fastest and slowest functions with a shell loop:

for n in 0 1 2 3 4 5 6 7 8 9; do echo same-function ${n}:; /usr/bin/time -f "%e seconds" same-function ${n}; /usr/bin/time -f "%e seconds" same-function ${n}; /usr/bin/time -f "%e seconds" same-function ${n}; done 

It produces results which are consistent to about 1% from run to run, with the exact numbers of the fastest and slowest functions varying depending on the exact binary layout:

same-function 0: 2.08 seconds 2.04 seconds 2.06 seconds same-function 1: 2.12 seconds 2.12 seconds 2.12 seconds same-function 2: 2.10 seconds 2.14 seconds 2.11 seconds same-function 3: 2.04 seconds 2.04 seconds 2.05 seconds same-function 4: 2.05 seconds 2.00 seconds 2.03 seconds same-function 5: 2.07 seconds 2.07 seconds 1.98 seconds same-function 6: 1.83 seconds 1.83 seconds 1.83 seconds same-function 7: 1.95 seconds 1.98 seconds 1.95 seconds same-function 8: 1.86 seconds 1.88 seconds 1.86 seconds same-function 9: 2.04 seconds 2.04 seconds 2.02 seconds 

In this case, we see that that loop2() is one of the slowest to execute and loop6() is one of the fastest, with a difference just over 10%. We reconfirm this by testing just these two cases repeatedly with a different method:

nate@haswell$ N=2; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done 7,180,104,866 cycles # 3.391 GHz 7,169,930,711 cycles # 3.391 GHz 7,150,190,394 cycles # 3.391 GHz 7,188,959,096 cycles # 3.391 GHz 7,177,272,608 cycles # 3.391 GHz 7,093,246,955 cycles # 3.391 GHz 7,210,636,865 cycles # 3.391 GHz 7,239,838,211 cycles # 3.391 GHz 7,172,716,779 cycles # 3.391 GHz 7,223,252,964 cycles # 3.391 GHz nate@haswell$ N=6; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done 6,234,770,361 cycles # 3.391 GHz 6,199,096,296 cycles # 3.391 GHz 6,213,348,126 cycles # 3.391 GHz 6,217,971,263 cycles # 3.391 GHz 6,224,779,686 cycles # 3.391 GHz 6,194,117,897 cycles # 3.391 GHz 6,225,259,274 cycles # 3.391 GHz 6,244,391,509 cycles # 3.391 GHz 6,189,972,381 cycles # 3.391 GHz 6,205,556,306 cycles # 3.391 GHz 

Considering this confirmed, we re-read every word in every Intel architectural manual ever written, sift through every page on the entire web that mentions the words 'computer' or 'programming', and meditate in isolation on a mountaintop for 6 years. Failing to achieve any sort of enlightenment, we come down to civilization, shave our beard, take a bath, and ask the experts of StackOverflow:

What can possibly be happening here?

Edit: With Benjamin's help (see his answer below) I've come up with an even more succinct test case. It's a standalone 20 lines of assembly. Changing from using SUB to SBB causes a 15% difference in performance even though the result remains the same and the same number of instructions are executed. Explanations? I think I'm getting closer to one.

; Minimal example, see also http://stackoverflow.com/q/26266953/3766665 ; To build (Linux): ; nasm -felf64 func.asm ; ld func.o ; Then run: ; perf stat -r10 ./a.out ; On Haswell and Sandy Bridge, observed runtime varies ; ~15% depending on whether sub or sbb is used in the loop section .text global _start _start: push qword 0h ; put counter variable on stack jmp loop ; jump to function align 64 ; function alignment. loop: mov rcx, 1000000000 align 64 ; loop alignment. l: mov rax, [rsp] add rax, 1h mov [rsp], rax ; sbb rcx, 1h ; which is faster: sbb or sub? sub rcx, 1h ; switch, time it, and find out jne l ; (rot13 spoiler: foo vf snfgre ol 15%) fin: ; If that was too easy, explain why. mov eax, 60 xor edi, edi ; End of program. Exit with code 0 syscall 
29
  • 2
    How are you accounting for background processes interrupting your program? (I'm not very familiar with Linux but on Windows this is a huge issue when trying to time simple but long-running code.) When the thread goes to sleep, it spends an unknown time interval not running when the timer is still - likely - ticking. Commented Oct 8, 2014 at 21:45
  • 1
    This does not appear to be an issue with processor contention. That might account for the variation in times among runs for the same N, but it does not account for the consistent, statistically-significant difference in runtimes for different N. Commented Oct 8, 2014 at 21:59
  • 3
    What is the address alignment for, say, loop6 and loop1 - are they both 32 byte aligned ? Do they have the same overall alignment ? Commented Oct 8, 2014 at 22:01
  • 4
    @NathanKurz Have you also disabled turbo boost? You'll want to ensure that the CPU isn't dynamically scaling the frequency. Also, did you try changing the loop execution order? Commented Oct 8, 2014 at 22:04
  • 2
    If you're executing the process 10 times (i.e., the program runs the loop once every time), then the answer is cache state. If you're repeating the loop within the same process (i.e., a single program executes the loop 10 times), then the answer is cache state and branch prediction. Commented Oct 8, 2014 at 22:34

2 Answers 2

9

Looking at the full perf stat output, you will see that it is not the number of instructions that varies but the number of stalled cycles.

Looking at the disassembly, I have found two things:

  1. The offset to the counter variable varies between the functions. Making counter local to each function did not make the behaviour go away, though.
  2. The functions are not placed on 64 byte boundaries, so they may cover a different amount of cache lines. Compiling with -falign-functions=64 did make the differences go away almost completely.

Doing the test with the aforementioned changes on my machine then yields:

for f in $( seq 7); do perf stat -e cycles -r3 ./same-function $f 2>&1; done|grep cycles 6,070,933,420 cycles ( +- 0.11% ) 6,052,771,142 cycles ( +- 0.06% ) 6,099,676,333 cycles ( +- 0.07% ) 6,092,962,697 cycles ( +- 0.16% ) 6,151,861,993 cycles ( +- 0.69% ) 6,074,323,033 cycles ( +- 0.36% ) 6,174,434,653 cycles ( +- 0.65% )

I don't know more about the nature of the stalls you found, though.

EDIT: I have made counter a volatile member in each function, tested different compilations on my I7-3537U and found `-falign-loops=64' actually to be the slowest:

$ gcc -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function $ gcc -falign-loops=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-l64 $ gcc -falign-functions=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-f64 $ for prog in same-function{,-l64,-f64}; do echo $prog; for f in $(seq 7); do perf stat -e cycles -r10 ./$prog $f 2>&1; done|grep cycl; done same-function 6,079,966,292 cycles ( +- 0.19% ) 7,419,053,569 cycles ( +- 0.07% ) 6,136,061,105 cycles ( +- 0.27% ) 7,282,434,896 cycles ( +- 0.74% ) 6,104,866,406 cycles ( +- 0.16% ) 7,342,985,942 cycles ( +- 0.52% ) 6,208,373,040 cycles ( +- 0.50% ) same-function-l64 7,336,838,175 cycles ( +- 0.46% ) 7,358,913,923 cycles ( +- 0.52% ) 7,412,570,515 cycles ( +- 0.38% ) 7,435,048,756 cycles ( +- 0.10% ) 7,404,834,458 cycles ( +- 0.34% ) 7,291,095,582 cycles ( +- 0.99% ) 7,312,052,598 cycles ( +- 0.95% ) same-function-f64 6,103,059,996 cycles ( +- 0.12% ) 6,116,601,533 cycles ( +- 0.29% ) 6,120,841,824 cycles ( +- 0.18% ) 6,114,278,098 cycles ( +- 0.09% ) 6,105,938,586 cycles ( +- 0.14% ) 6,101,672,717 cycles ( +- 0.19% ) 6,121,339,944 cycles ( +- 0.11% ) 

More details on align-loops vs align-functions:

$ for prog in same-function{-l64,-f64}; do sudo perf stat -d -r10 ./$prog 0; done Performance counter stats for './same-function-l64 0' (10 runs): 2396.608194 task-clock:HG (msec) # 1.001 CPUs utilized ( +- 0.64% ) 56 context-switches:HG # 0.024 K/sec ( +- 5.51% ) 1 cpu-migrations:HG # 0.000 K/sec ( +- 74.78% ) 46 page-faults:HG # 0.019 K/sec ( +- 0.63% ) 7,331,450,530 cycles:HG # 3.059 GHz ( +- 0.51% ) [85.68%] 5,332,248,218 stalled-cycles-frontend:HG # 72.73% frontend cycles idle ( +- 0.71% ) [71.42%] <not supported> stalled-cycles-backend:HG 5,000,800,933 instructions:HG # 0.68 insns per cycle # 1.07 stalled cycles per insn ( +- 0.04% ) [85.73%] 1,000,446,303 branches:HG # 417.443 M/sec ( +- 0.04% ) [85.75%] 8,461 branch-misses:HG # 0.00% of all branches ( +- 6.05% ) [85.76%] <not supported> L1-dcache-loads:HG 45,593 L1-dcache-load-misses:HG # 0.00% of all L1-dcache hits ( +- 3.61% ) [85.77%] 6,148 LLC-loads:HG # 0.003 M/sec ( +- 8.80% ) [71.36%] <not supported> LLC-load-misses:HG 2.394456699 seconds time elapsed ( +- 0.64% ) Performance counter stats for './same-function-f64 0' (10 runs): 1998.936383 task-clock:HG (msec) # 1.001 CPUs utilized ( +- 0.61% ) 60 context-switches:HG # 0.030 K/sec ( +- 17.77% ) 1 cpu-migrations:HG # 0.001 K/sec ( +- 47.86% ) 46 page-faults:HG # 0.023 K/sec ( +- 0.68% ) 6,107,877,836 cycles:HG # 3.056 GHz ( +- 0.34% ) [85.63%] 4,112,602,649 stalled-cycles-frontend:HG # 67.33% frontend cycles idle ( +- 0.52% ) [71.41%] <not supported> stalled-cycles-backend:HG 5,000,910,172 instructions:HG # 0.82 insns per cycle # 0.82 stalled cycles per insn ( +- 0.01% ) [85.72%] 1,000,423,026 branches:HG # 500.478 M/sec ( +- 0.02% ) [85.77%] 10,660 branch-misses:HG # 0.00% of all branches ( +- 13.23% ) [85.80%] <not supported> L1-dcache-loads:HG 47,492 L1-dcache-load-misses:HG # 0.00% of all L1-dcache hits ( +- 14.82% ) [85.80%] 11,719 LLC-loads:HG # 0.006 M/sec ( +- 42.44% ) [71.28%] <not supported> LLC-load-misses:HG 1.997319759 seconds time elapsed ( +- 0.62% ) 

Both executables have a very low instructions/cycle count, probably due to the minimalistic nature of the loop and the memory pressure that it is inflicting, but I have no idea on why one is worse than the other.

I have also tried something along the lines of

$ for prog in same-function{-l64,-f64}; do sudo perf stat -eL1-{d,i}cache-load-misses,L1-dcache-store-misses,cs,cycles,instructions -r10 ./$prog 0; done 

and

$ sudo perf record -F25000 -e'{cycles:pp,stalled-cycles-frontend}' ./same-function-l64 0 [ perf record: Woken up 28 times to write data ] [ perf record: Captured and wrote 6.771 MB perf.data (~295841 samples) ] $ sudo perf report --group -Sloop0 -n --show-total-period --stdio $ sudo perf annotate --group -sloop0 --stdio 

But without any success on finding the culprit. Still, I felt it might be helpful to jot it down here for you anyway...

Edit 2: Here is my patch to same-function.c:

$ git diff -u -U0 diff --git a/same-function.c b/same-function.c index f78449e..78a5772 100644 --- a/same-function.c +++ b/same-function.c @@ -20 +20 @@ done -volatile uint64_t counter = 0; +//volatile uint64_t counter = 0; @@ -22,0 +23 @@ COMPILER_NO_INLINE void loop0(void) { +volatile uint64_t counter = 0; @@ -31,0 +33 @@ COMPILER_NO_INLINE void loop1(void) { +volatile uint64_t counter = 0; @@ -40,0 +43 @@ COMPILER_NO_INLINE void loop2(void) { +volatile uint64_t counter = 0; @@ -49,0 +53 @@ COMPILER_NO_INLINE void loop3(void) { +volatile uint64_t counter = 0; @@ -58,0 +63 @@ COMPILER_NO_INLINE void loop4(void) { +volatile uint64_t counter = 0; @@ -67,0 +73 @@ COMPILER_NO_INLINE void loop5(void) { +volatile uint64_t counter = 0; @@ -76,0 +83 @@ COMPILER_NO_INLINE void loop6(void) { +volatile uint64_t counter = 0; @@ -85,0 +93 @@ COMPILER_NO_INLINE void loop7(void) { +volatile uint64_t counter = 0; @@ -94,0 +103 @@ COMPILER_NO_INLINE void loop8(void) { +volatile uint64_t counter = 0; @@ -103,0 +113 @@ COMPILER_NO_INLINE void loop9(void) { +volatile uint64_t counter = 0; @@ -135 +145 @@ int main(int argc, char** argv) { -} \ No newline at end of file +} 

Edit 3: Same thing with even more minimal example:

; Minimal example, see also http://stackoverflow.com/q/26266953/3766665 ; To build (Linux): ; nasm -felf64 func.asm ; ld func.o ; Then run: ; perf stat -r10 ./a.out ; Runtime varies ~10% depending on whether section .text global _start _start: push qword 0h ; put counter variable on stack jmp loop ; jump to function ;align 64 ; function alignment. Try commenting this loop: mov rcx, 1000000000 ;align 64 ; loop alignment. Try commenting this l: mov rax, [rsp] add rax, 1h mov [rsp], rax sub rcx, 1h jne l fin: ; End of program. Exit with code 0 mov eax, 60 xor edi, edi syscall 

Same effect here. Interesting.

Cheers, Benjamin

Sign up to request clarification or add additional context in comments.

13 Comments

Modern processors are complex non-deterministic beasts. The one you are testing has L2 and L3 cache, 4 cores, 8 threads, branch prediction and out of order execution. Stalls come from the myriad of interactions between all these aspects of the CPU. A cache miss, leads to a memory access, leads to a possible thread switch, leads to a flushing of branch prediction, etc. When you look at the detailed schematic of the CPU its a wonder that the thing even works and gives you consistent answers most of the time. Can't expect it to time tham all the same as well.
@Nathan, I took the compilation instructions from the comment on top of same-function.c, failed to note that it did not contain the -falign-loops=64 you have suggested and then checked for alignment with "eu-readelf -s same-function | grep loop" and "objdump -d same-function". I was just wondering that the code wasn't aligned on 64 byte boundaries and if maybe instruction cache misses might be an issue, so I have tried -falign-functions. I will try to answer your other questions in the evening, since I'm away from my machine right now. Cheers!
@Natan, also I was using a different architecture (sandy bridge), but I could reproduce your initial findings there as well.
Per Intel Optimization 3.6.5.2, putting a load too soon after a store to the same address "can lead to sizable but non-deterministic delays". It even specifically mentions volatile variables. In this case, it appears that "too soon" means "sent to the RAT in the same cycle". This causes the slowdown. Symptomatically, we can measure more store and load instructions than we expect being executed, but only the expected number being retired. I still don't understand the details of which alignment causes which speed, but I presume it must affect the way uops are being issued. Mindbending.
I don't know yet how inc with a memory operand works. I'll try to explore. But yes, this does have practical use! I've come across this problem a couple times now, and this is the first time I really tried to figure it out. The latest application is well described here: fastcompression.blogspot.com/2014/09/…. Yann was writing a histogram to count the number of occurrences of each byte as in input to a compression algorithm. But it performed worse as the data got more predictable. The predictable case ends up behaving a lot like this loop.
|
1

A few years ago, I would have told you to check for any difference in the internal state of the CPU when it arrives at any of those loops; as this was known to have a profound effect on the capacity of the Out-of-Order predictions process (or something like that). For example, the performance of the very same loop could change by up to something like 15-20% depending on what the CPU was doing just before entering the loop and just the fact of jumping from two different points could be sufficient to change the execution speed.

In your case, it's quite to test. All you have to do is to change the order of the instructions in the IF block; for example replacing the following:

 switch (firstLetter) { case '0': loop0(); break; case '1': loop1(); break; case '2': loop2(); break; case '3': loop3(); break; case '4': loop4(); break; case '5': loop5(); break; case '6': loop6(); break; case '7': loop7(); break; case '8': loop8(); break; case '9': loop9(); break; default: goto die_usage; } 

with:

 switch (firstLetter) { case '9': loop9(); break; case '8': loop8(); break; case '7': loop7(); break; case '6': loop6(); break; case '5': loop5(); break; case '4': loop4(); break; case '3': loop3(); break; case '2': loop2(); break; case '1': loop1(); break; case '0': loop0(); break; default: goto die_usage; } 

or with any random order. Of course, you should check the generated assembly code to make sure that the order of the instructions has not been reordered by the compilator.

Also, as your loops are inside individual functions; you should also make sure that these functions are themselves aligned on 64 bytes boundaries.

3 Comments

I'm not sure that I understand your answer. If you are suggesting that the location from which the call occurs determines the speed of the function, I don't know what the mechanism for this might be. But I changed the order to that you proposed, and found that the position of the function within the switch statement does not affect the speed with which it executes. That is, the speed appears to be a property of the function. To reconfirm, I also edited the generated assembly to swap the calls to the 'fastest' and 'slowest' functions from the original, and found that they remained such.
That's was my idea. A few years ago, the OOO mecanism was very unpredictable. In your case, another possibility would be the difference in alignment for each function. You could also check to replace the functions with goto statements.
You could also check to replace the functions with goto statements. I get similar timing differences when I allow the functions to be inlined, which I think should be the same (or more definitive) than switching to a goto. It changes which 'function' is fast, but doesn't remove the issue. I didn't do this in the example because I thought that having everything in main() complicated things.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.