I hope that I have reduced my question to a simple and reproducible test case. The source (which is here) contains 10 copies of an identical simple loop. Each loop is of the form:
#define COUNT (1000 * 1000 * 1000) volatile uint64_t counter = 0; void loopN(void) { for (int j = COUNT; j != 0; j--) { uint64_t val = counter; val = val + 1; counter = val; } return; } The 'volatile' of the variable is important, as it forces the value to read and written from memory on each iteration. Each loop is aligned to 64 bytes using '-falign-loops=64' and produces identical assembly except for the relative offset to the global:
400880: 48 8b 15 c1 07 20 00 mov 0x2007c1(%rip),%rdx # 601048 <counter> 400887: 48 83 c2 01 add $0x1,%rdx 40088b: 83 e8 01 sub $0x1,%eax 40088e: 48 89 15 b3 07 20 00 mov %rdx,0x2007b3(%rip) # 601048 <counter> 400895: 75 e9 jne 400880 <loop8+0x20> I'm running Linux 3.11 on an Intel Haswell i7-4470. I'm compiling the program with GCC 4.8.1 and the command line:
gcc -std=gnu99 -O3 -falign-loops=64 -Wall -Wextra same-function.c -o same-function I'm also using attribute((noinline)) within the source to make the assembly clearer, but this is not necessary to observe the issue. I find the fastest and slowest functions with a shell loop:
for n in 0 1 2 3 4 5 6 7 8 9; do echo same-function ${n}:; /usr/bin/time -f "%e seconds" same-function ${n}; /usr/bin/time -f "%e seconds" same-function ${n}; /usr/bin/time -f "%e seconds" same-function ${n}; done It produces results which are consistent to about 1% from run to run, with the exact numbers of the fastest and slowest functions varying depending on the exact binary layout:
same-function 0: 2.08 seconds 2.04 seconds 2.06 seconds same-function 1: 2.12 seconds 2.12 seconds 2.12 seconds same-function 2: 2.10 seconds 2.14 seconds 2.11 seconds same-function 3: 2.04 seconds 2.04 seconds 2.05 seconds same-function 4: 2.05 seconds 2.00 seconds 2.03 seconds same-function 5: 2.07 seconds 2.07 seconds 1.98 seconds same-function 6: 1.83 seconds 1.83 seconds 1.83 seconds same-function 7: 1.95 seconds 1.98 seconds 1.95 seconds same-function 8: 1.86 seconds 1.88 seconds 1.86 seconds same-function 9: 2.04 seconds 2.04 seconds 2.02 seconds In this case, we see that that loop2() is one of the slowest to execute and loop6() is one of the fastest, with a difference just over 10%. We reconfirm this by testing just these two cases repeatedly with a different method:
nate@haswell$ N=2; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done 7,180,104,866 cycles # 3.391 GHz 7,169,930,711 cycles # 3.391 GHz 7,150,190,394 cycles # 3.391 GHz 7,188,959,096 cycles # 3.391 GHz 7,177,272,608 cycles # 3.391 GHz 7,093,246,955 cycles # 3.391 GHz 7,210,636,865 cycles # 3.391 GHz 7,239,838,211 cycles # 3.391 GHz 7,172,716,779 cycles # 3.391 GHz 7,223,252,964 cycles # 3.391 GHz nate@haswell$ N=6; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done 6,234,770,361 cycles # 3.391 GHz 6,199,096,296 cycles # 3.391 GHz 6,213,348,126 cycles # 3.391 GHz 6,217,971,263 cycles # 3.391 GHz 6,224,779,686 cycles # 3.391 GHz 6,194,117,897 cycles # 3.391 GHz 6,225,259,274 cycles # 3.391 GHz 6,244,391,509 cycles # 3.391 GHz 6,189,972,381 cycles # 3.391 GHz 6,205,556,306 cycles # 3.391 GHz Considering this confirmed, we re-read every word in every Intel architectural manual ever written, sift through every page on the entire web that mentions the words 'computer' or 'programming', and meditate in isolation on a mountaintop for 6 years. Failing to achieve any sort of enlightenment, we come down to civilization, shave our beard, take a bath, and ask the experts of StackOverflow:
What can possibly be happening here?
Edit: With Benjamin's help (see his answer below) I've come up with an even more succinct test case. It's a standalone 20 lines of assembly. Changing from using SUB to SBB causes a 15% difference in performance even though the result remains the same and the same number of instructions are executed. Explanations? I think I'm getting closer to one.
; Minimal example, see also http://stackoverflow.com/q/26266953/3766665 ; To build (Linux): ; nasm -felf64 func.asm ; ld func.o ; Then run: ; perf stat -r10 ./a.out ; On Haswell and Sandy Bridge, observed runtime varies ; ~15% depending on whether sub or sbb is used in the loop section .text global _start _start: push qword 0h ; put counter variable on stack jmp loop ; jump to function align 64 ; function alignment. loop: mov rcx, 1000000000 align 64 ; loop alignment. l: mov rax, [rsp] add rax, 1h mov [rsp], rax ; sbb rcx, 1h ; which is faster: sbb or sub? sub rcx, 1h ; switch, time it, and find out jne l ; (rot13 spoiler: foo vf snfgre ol 15%) fin: ; If that was too easy, explain why. mov eax, 60 xor edi, edi ; End of program. Exit with code 0 syscall
loop6andloop1- are they both 32 byte aligned ? Do they have the same overall alignment ?