GCC optimization missed opportunity

Question

I'm compiling this C code:

int mode; // use aa if true, else bb int aa[2]; int bb[2]; inline int auto0() { return mode ? aa[0] : bb[0]; } inline int auto1() { return mode ? aa[1] : bb[1]; } int slow() { return auto1() - auto0(); } int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }

Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:

slow():

 movl mode(%rip), %ecx testl %ecx, %ecx je .L10 movl aa+4(%rip), %eax movl aa(%rip), %edx subl %edx, %eax ret .L10: movl bb+4(%rip), %eax movl bb(%rip), %edx subl %edx, %eax ret

fast():

 movl mode(%rip), %esi testl %esi, %esi jne .L18 movl bb+4(%rip), %eax subl bb(%rip), %eax ret .L18: movl aa+4(%rip), %eax subl aa(%rip), %eax ret

Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.

A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.

Why should this be? Can you tweak my source code somehow so that GCC does a better job?

Edit: here are the results with clang 4.2 on Mac OS:

slow():

 movq _aa@GOTPCREL(%rip), %rax ; rax = aa (both ints at once) movq _bb@GOTPCREL(%rip), %rcx ; rcx = bb movq _mode@GOTPCREL(%rip), %rdx ; rdx = mode cmpl $0, (%rdx) ; mode == 0 ? leaq 4(%rcx), %rdx ; rdx = bb[1] cmovneq %rax, %rcx ; if (mode != 0) rcx = aa leaq 4(%rax), %rax ; rax = aa[1] cmoveq %rdx, %rax ; if (mode == 0) rax = bb movl (%rax), %eax ; eax = xx[1] subl (%rcx), %eax ; eax -= xx[0]

fast():

 movq _mode@GOTPCREL(%rip), %rax ; rax = mode cmpl $0, (%rax) ; mode == 0 ? je LBB1_2 ; if (mode != 0) { movq _aa@GOTPCREL(%rip), %rcx ; rcx = aa jmp LBB1_3 ; } else { LBB1_2: ; // (mode == 0) movq _bb@GOTPCREL(%rip), %rcx ; rcx = bb LBB1_3: ; } movl 4(%rcx), %eax ; eax = xx[1] subl (%rcx), %eax ; eax -= xx[0]

Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.

GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.

Is it only "slow" in disassembly or have you profiled it? The two versions may have equivalent performance on a modern CPU. Since you're on x86_64 you can't use -march= to try older cores (a certain base level of functionality is assumed in 64-bit). — Ben Jackson
– Ben Jackson, Commented Sep 23, 2013 at 3:54
A lot of this boils to the fact that while compilers are great at optimizing code, they rarely generate optimal code. After all, it's an NP-hard problem. So a lot of times, they don't even try to produce optimal code. — Mysticial
– Mysticial, Commented Sep 23, 2013 at 3:56
But to answer your question of why this happens, it seems that GCC doesn't try to do a final instruction merging pass. — Mysticial
– Mysticial, Commented Sep 23, 2013 at 4:02
@John, note that gcc also emits conditional moves, but only in auto0/1 which you didn't write here (at least it did for me, since auto0/1 still exist in their non-inlined form too). It seems that when inlining this code into slow, gcc removed that (or didn't write it in the first place), and clang didn't - maybe it has something to do with the stage at which inlining occurs. Can you try it with gcc -fearly-inlining ? — Leeor
– Leeor, Commented Sep 25, 2013 at 14:36

chill · Accepted Answer · 2013-09-25 13:12:01Z

The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:

slow () { int D.1405; int mode.3; int D.1402; int D.1379; # BLOCK 2 freq:10000 mode.3_5 = mode; if (mode.3_5 != 0) goto <bb 3>; else goto <bb 4>; # BLOCK 3 freq:5000 D.1402_6 = aa[1]; D.1405_10 = aa[0]; goto <bb 5>; # BLOCK 4 freq:5000 D.1402_7 = bb[1]; D.1405_11 = bb[0]; # BLOCK 5 freq:10000 D.1379_3 = D.1402_17 - D.1405_12; return D.1379_3; }

whereas in fast() they are in the same basic block:

fast () { int D.1377; int D.1376; int D.1374; int D.1373; int mode.1; int D.1368; # BLOCK 2 freq:10000 mode.1_2 = mode; if (mode.1_2 != 0) goto <bb 3>; else goto <bb 4>; # BLOCK 3 freq:3900 D.1373_3 = aa[1]; D.1374_4 = aa[0]; D.1368_5 = D.1373_3 - D.1374_4; goto <bb 5>; # BLOCK 4 freq:6100 D.1376_6 = bb[1]; D.1377_7 = bb[0]; D.1368_8 = D.1376_6 - D.1377_7; # BLOCK 5 freq:10000 return D.1368_1; }

GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().

Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).

I'll upvote you if you state how you created the above output. It's not preprocessed C source, nor the debug info in the assembly source.
@FrankH., heh, sorry, it's the output from gcc -c -da -O2 slow.c, it produces lots and lots of files with dumps of the state of the compile unit at each RTL optimization phase.
This! GCC could be made to produce much better code by rerunning various optimization passes at multiple points in the pipeline, but it would make builds take all day and only shave off one instruction here, one instruction there. It's a speed/quality trade off. As far as I know this one is not parameterized; presumably because it would take real work to implement and add lots of code to check corner cases and such.

jxh · Accepted Answer · 2013-09-23 19:39:01Z

I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:

inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; } inline int kwiky(int *xx) { return xx[1] - xx[0]; } int kwik() { return kwiky(xx()); }

When compiled by GCC 4.7 with -O3:

 movl mode, %edx xorl %eax, %eax testl %edx, %edx setne %al movl xx.1369(,%eax,4), %edx movl 4(%edx), %eax subl (%edx), %eax ret

With the definition of xx(), you can redefine auto0() and auto1() like so:

inline int auto0() { return xx()[0]; } inline int auto1() { return xx()[1]; }

And, from this, you should see that slow() now compiles into code similar or identical to kwik().

Thanks for that. A quick test showed the performance is better than either slow() or fast(). However, I don't think I can use this solution because in my real code things are slightly more complex (duh), and aa is actually a pair of functions rather than plain integers.
@JohnZwinck: I updated the answer to show how the xx() function lets you redefine auto0() and auto1() to make slow() faster than fast().
I don't quite follow...xx() still contains a pointer to int aa[2] but in my real code I need to call aa0() and aa1() for each access. int bb[2] is really an array, but I don't see how to choose between e.g. aa0() and bb[0] using your approach. Am I missing something?
@JohnZwinck: Oh sorry, I am still answering your posted question. I can't speak to your real code, since there is nothing in your question about it.
In the real code, auto0() is like { return mode ? aa0() : bb[0]; }. And similar for auto1(). This complicates the example somewhat, but GCC makes the same "mistake" in optimizing, with an extra instruction and an extra register used in slow().

user2284570 · Accepted Answer · 2013-09-28 23:54:07Z

Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).

Some of them control code reduction/deduplication.

For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .

Collectives™ on Stack Overflow

GCC optimization missed opportunity

3 Answers 3

3 Comments

7 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

7 Comments

Comments

Related