I'm compiling this C code:
int mode; // use aa if true, else bb int aa[2]; int bb[2]; inline int auto0() { return mode ? aa[0] : bb[0]; } inline int auto1() { return mode ? aa[1] : bb[1]; } int slow() { return auto1() - auto0(); } int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; } Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx testl %ecx, %ecx je .L10 movl aa+4(%rip), %eax movl aa(%rip), %edx subl %edx, %eax ret .L10: movl bb+4(%rip), %eax movl bb(%rip), %edx subl %edx, %eax ret fast():
movl mode(%rip), %esi testl %esi, %esi jne .L18 movl bb+4(%rip), %eax subl bb(%rip), %eax ret .L18: movl aa+4(%rip), %eax subl aa(%rip), %eax ret Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa@GOTPCREL(%rip), %rax ; rax = aa (both ints at once) movq _bb@GOTPCREL(%rip), %rcx ; rcx = bb movq _mode@GOTPCREL(%rip), %rdx ; rdx = mode cmpl $0, (%rdx) ; mode == 0 ? leaq 4(%rcx), %rdx ; rdx = bb[1] cmovneq %rax, %rcx ; if (mode != 0) rcx = aa leaq 4(%rax), %rax ; rax = aa[1] cmoveq %rdx, %rax ; if (mode == 0) rax = bb movl (%rax), %eax ; eax = xx[1] subl (%rcx), %eax ; eax -= xx[0] fast():
movq _mode@GOTPCREL(%rip), %rax ; rax = mode cmpl $0, (%rax) ; mode == 0 ? je LBB1_2 ; if (mode != 0) { movq _aa@GOTPCREL(%rip), %rcx ; rcx = aa jmp LBB1_3 ; } else { LBB1_2: ; // (mode == 0) movq _bb@GOTPCREL(%rip), %rcx ; rcx = bb LBB1_3: ; } movl 4(%rcx), %eax ; eax = xx[1] subl (%rcx), %eax ; eax -= xx[0] Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.
-march=to try older cores (a certain base level of functionality is assumed in 64-bit).