Larger Code is Still Faster

Question

When compiling C code with gcc, there are compiler optimizations, some that limit code size and others create fast code.

From the -S flag, I see that the -O2/03 generates more assembly than the -Os code. How is more assembly still faster than less assembly?

This is often true for non C languages as well. Such as bytecode interpreters. It is faster to execute instructions moving forward then it is to jump back and forth to reuse code. So faster code tents to be larger. — Reactgular
– Reactgular, Commented Sep 30, 2013 at 1:09
With modern CPUs, the raw number of assembly instructions isn't a good indicator of speed. Other factors, like "will these instructions and data be cached by the CPU", are much more important. — user16764
– user16764, Commented Sep 30, 2013 at 1:28
Not always - compare the output for int main (int argc, char** argv) { int i = 0; int sum = 0; for (i = 0; i < 10; i++) { sum += i + 1; } printf("%d\n",sum); } -- no optimizations is 31 instructions, with -O2, its 8 instructions. Granted, not the best example, still - not always true. — user40980
– user40980, Commented Sep 30, 2013 at 2:53
Few assembly instructions in the output file isn't the same as fewer assembly instructions executed when run. — user53141
– user53141, Commented Sep 30, 2013 at 3:30

Bart van Ingen Schenau · Accepted Answer · 2013-09-30 06:37:20Z

On a modern processor, there are usually several ways to achieve the result specified in a higher level language (such as C). These solutions can have different trade-offs between code size and speed due to several factors.

Not all assembly instructions take the same amount of time to execute. For example, it is possible that a particular result can be achieved with 2 instructions that take 10 clock-cycles each to execute, or with 6 instructions that take 3 clock-cycles each. The difference here can be because those two long instructions duplicate some of the work that the compiler avoided by using the 6 short instructions.
On a modern processor, it makes a huge difference in execution speed if the next instruction is already present in the cache or if it has to come from main memory. This effect is most visible with branching instructions, because they make it harder to tell what the next instruction will be. Often, compilers will try to offset these effects by unrolling (part of) a loop into a repeated block of instructions to reduce the costs of branching/jumping.

Jan Hudec · Accepted Answer · 2013-09-30 07:56:57Z

Well, most of the time the complier generates more instructions so that fewer of them are executed in a given run. Usually by generating specific code for different cases:

Loop unrolling. The jump is only done in every n (usually 8) iterations.
Function inlining. It saves the call, return, copying arguments and stack manipulation.

The other thing is that some instructions take more time than other. Especially conditions that are difficult to predict can be significantly slower. However both calls and loops are common and the predictor handles them well.

The other thing is memory cache, but here things are not so clear cut. The caches work better when the code is read linearly (functions are inlined), but it also has limited size, so larger portion of small code will be cached.

Stack Exchange Network

Larger Code is Still Faster

2 Answers 2

Linked

Hot Network Questions

Larger Code is Still Faster

2 Answers 2

Linked

Related

Hot Network Questions