56

I find this topic Why is it faster to process a sorted array than an unsorted array? . And try to run this code. And I find strange behavior. If I compile this code with -O3 optimization flag it takes 2.98605 sec to run. If I compile with -O2 it takes 1.98093 sec. I try to run this code several times(5 or 6) on the same machine in the same environment, I close all other software(chrome, skype etc).

gcc --version gcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2 Copyright (C) 2014 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 

So please can you explain to me why this happens? I read gcc manual and I see that -O3 includes -O2. Thank you for help.

P.S. add code

#include <algorithm> #include <ctime> #include <iostream> int main() { // Generate data const unsigned arraySize = 32768; int data[arraySize]; for (unsigned c = 0; c < arraySize; ++c) data[c] = std::rand() % 256; // !!! With this, the next loop runs faster std::sort(data, data + arraySize); // Test clock_t start = clock(); long long sum = 0; for (unsigned i = 0; i < 100000; ++i) { // Primary loop for (unsigned c = 0; c < arraySize; ++c) { if (data[c] >= 128) sum += data[c]; } } double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC; std::cout << elapsedTime << std::endl; std::cout << "sum = " << sum << std::endl; } 
8
  • 1
    Did you run each program once? You should try a few times. Also make sure nothing else is running on the machine you use for benchmarking, Commented Mar 5, 2015 at 10:21
  • 1
    @BasileStarynkevitch i add code. I try several times and have same results. I try to compile with -mtune=native - same result as before(without this flag). Processor - Intel Core i5 -2400 Commented Mar 5, 2015 at 10:24
  • 4
    I just experimented a little bit and added to O2 additional optimizations that O3 performs one at a time. The additional optimization flags that O3 adds for me are: -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops. I found that adding -ftree-vectorize as optimization flag to O2 is the one that has the negative impact. I'm on Windows 7 with mingw-gcc 4.7.2. Commented Mar 5, 2015 at 10:45
  • 3
    @doctorlove I can't explain why it is slower with autovectorization of loops so i thought it's too little information for an answer :) Commented Mar 5, 2015 at 11:11
  • 1
    Changing the variable sum from a local to a global or static one makes the difference between O2 and O3 vanish. The problem seems to be related to lots of stack operations to store and retrieve the variable sum inside the loop if it's local. My knowledge of Assembly is too limited to fully understand the generated code by gcc:) Commented Mar 5, 2015 at 13:18

2 Answers 2

79

gcc -O3 uses a cmov for the conditional, so it lengthens the loop-carried dependency chain to include a cmov (which is 2 uops and 2 cycles of latency on your Intel Sandybridge CPU, according to Agner Fog's instruction tables. See also the tag wiki). This is one of the cases where cmov sucks.

If the data was even moderately unpredictable, cmov would probably be a win, so this is a fairly sensible choice for a compiler to make. (However, compilers may sometimes use branchless code too much.)

I put your code on the Godbolt compiler explorer to see the asm (with nice highlighting and filtering out irrelevant lines. You still have to scroll down past all the sort code to get to main(), though).

.L82: # the inner loop from gcc -O3 movsx rcx, DWORD PTR [rdx] # sign-extending load of data[c] mov rsi, rcx add rcx, rbx # rcx = sum+data[c] cmp esi, 127 cmovg rbx, rcx # sum = data[c]>127 ? rcx : sum add rdx, 4 # pointer-increment cmp r12, rdx jne .L82 

gcc could have saved the MOV by using LEA instead of ADD.

The loop bottlenecks on the latency of ADD->CMOV (3 cycles), since one iteration of the loop writes rbx with CMO, and the next iteration reads rbx with ADD.

The loop only contains 8 fused-domain uops, so it can issue at one per 2 cycles. Execution-port pressure is also not as bad a bottleneck as the latency of the sum dep chain, but it's close (Sandybridge only has 3 ALU ports, unlike Haswell's 4).

BTW, writing it as sum += (data[c] >= 128 ? data[c] : 0); to take the cmov out of the loop-carried dep chain is potentially useful. Still lots of instructions, but the cmov in each iteration is independent. This compiles as expected in gcc6.3 -O2 and earlier, but gcc7 de-optimizes into a cmov on the critical path (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666). (It also auto-vectorizes with earlier gcc versions than the if() way of writing it.)

Clang takes the cmov off the critical path even with the original source.


gcc -O2 uses a branch (for gcc5.x and older), which predicts well because your data is sorted. Since modern CPUs use branch-prediction to handle control dependencies, the loop-carried dependency chain is shorter: just an add (1 cycle latency).

The compare-and-branch in every iteration is independent, thanks to branch-prediction + speculative execution, which lets execution continue before the branch direction is known for sure.

.L83: # The inner loop from gcc -O2 movsx rcx, DWORD PTR [rdx] # load with sign-extension from int32 to int64 cmp ecx, 127 jle .L82 # conditional-jump over the next instruction add rbp, rcx # sum+=data[c] .L82: add rdx, 4 cmp rbx, rdx jne .L83 

There are two loop-carried dependency chains: sum and the loop-counter. sum is 0 or 1 cycle long, and the loop-counter is always 1 cycle long. However, the loop is 5 fused-domain uops on Sandybridge, so it can't execute at 1c per iteration anyway, so latency isn't a bottleneck.

It probably runs at about one iteration per 2 cycles (bottlenecked on branch instruction throughput), vs. one per 3 cycles for the -O3 loop. The next bottleneck would be ALU uop throughput: 4 ALU uops (in the not-taken case) but only 3 ALU ports. (ADD can run on any port).

This pipeline-analysis prediction matches pretty much exactly with your timings of ~3 sec for -O3 vs. ~2 sec for -O2.


Haswell/Skylake could run the not-taken case at one per 1.25 cycles, since it can execute a not-taken branch in the same cycle as a taken branch and has 4 ALU ports. (Or slightly less since a 5 uop loop doesn't quite issue at 4 uops every cycle).

(Just tested: Skylake @ 3.9GHz runs the branchy version of the whole program in 1.45s, or the branchless version in 1.68s. So the difference is much smaller there.)


g++6.3.1 uses cmov even at -O2, but g++5.4 still behaves like 4.9.2.

With both g++6.3.1 and g++5.4, using -fprofile-generate / -fprofile-use produces the branchy version even at -O3 (with -fno-tree-vectorize).

The CMOV version of the loop from newer gcc uses add ecx,-128 / cmovge rbx,rdx instead of CMP/CMOV. That's kinda weird, but probably doesn't slow it down. ADD writes an output register as well as flags, so creates more pressure on the number of physical registers. But as long as that's not a bottleneck, it should be about equal.


Newer gcc auto-vectorizes the loop with -O3, which is a significant speedup even with just SSE2. (e.g. my i7-6700k Skylake runs the vectorized version in 0.74s, so about twice as fast as scalar. Or -O3 -march=native in 0.35s, using AVX2 256b vectors).

The vectorized version looks like a lot of instructions, but it's not too bad, and most of them aren't part of a loop-carried dep chain. It only has to unpack to 64-bit elements near the end. It does pcmpgtd twice, though, because it doesn't realize it could just zero-extend instead of sign-extend when the condition has already zeroed all negative integers.

Sign up to request clarification or add additional context in comments.

7 Comments

BTW, I saw this question ages ago, probably when it was first posted, but I guess got side-tracked from answering it until now (when I was reminded of it).
Do -fprofile-generate and -fprofile-use help in this case?
@MarcGlisse: Just tested: yes, g++5.4 and g++6.3.1 make the same branchy code with -O3 -fno-tree-vectorize -fprofile-use. (Even though without PGO, g++6.3.1 uses CMOV even at -O2). On 3.9GHz Skylake, the CMOV version runs in 1.68s, while the branchy version runs in 1.45s, so the difference is much smaller with efficient CMOV.
@MarcGlisse: updated the answer with more stuff. Why is newer gcc using add ecx, -128 instead of a CMP? Is that just for code-size reasons (since -128 fits in a sign-extended imm8)? I guess that's probably worth writing ecx for no reason, since it's dead at that point and OOO execution can free it soon. I'm surprised it still doesn't use LEA to compute sum+data[c] in a different register to avoid the MOV, though.
A lot of it seems to be tuning choices, playing with -mtune=... changes add to cmp. No idea about lea. On a skylake laptop, -O3 code is significantly faster than -O2 code.
|
1

A more general answer regarding compiler optimization would be:

Unless the compiler knows exactly which CPU you are compiling for, compilers can only make assumptions about which code is most likely to run faster.

And by exactly the CPU, I don't mean x86_64, that's an architecture, I mean exactly which x86 CPU, because the same code can have very different runtime performance on an Intel i3 and an Intel i5, on two different generations of an Intel i5, or on an Intel and an AMD CPU. The compiler tries to produce code that runs very well on all CPUs that support the same architecture and instruction set, but that code will run better on some CPU models and worse on others.

Sometimes an optimization that makes sense and will probably result in faster execution on most CPUs, will unfortunately result in slower execution on some CPUs.

And code usually processes data, and the compiler cannot know what that data will look like at runtime. However, the data itself affects the runtime behavior of CPUs through branch prediction, data caching, memory access patterns, and so on. So code that speeds up processing of one type of data may actually slow down processing of another type of data.

Finally, modern CPUs often don't actually execute the instructions you feed them. What they do is translate the instructions into micro-instructions, which can then be optimized by the CPU itself before execution. The compiler has no way of knowing what is going on inside the CPU, so an optimization that sounds plausible may actually slow things down by making internal CPU optimization difficult or impossible. Multiple optimizations are not always better in the end. Too many cooks spoil the broth.

In your specific case, the compiler thought it was a good idea to reduce branching, and it is right, usually it is a good idea to reduce branching. However, if the data is sorted and branch prediction works very well, then code with branches may be faster than code without branches. If you just use -O2 instead of -O3, the compiler will keep the branches, which is a win in this very specific case.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.