Code runs an order of magnitude faster when optimizations enabled. Am I missing something?

Question

As far as my (limited) knowledge of optimization goes, I always thought that compiler optimizations don't matter that much. I mean, we can surely gain quite a few percents (maybe 0...100% as a very rough first thoughtproximation) of time by using registers instead of memory and unrolling loops, but the fundamental factor that constrains the performance of code is really the choice of algorithms.

I've recently started a pet project - a small interpreted scripting language. It compiles to bytecode and executes it using a virtual machine. For the sake of easy debugging, profiling and testing, first I've naturally compiled the code with the -O0 -g flags of gcc (and clang), then with -O2. I've then time'd a small program running on the VM, which basically does this (pseudo-code, I'm not showing you the actual syntax until the project goes public):

i = 1 sum = 0 while (i < 10000000) { sum = (sum + i * i) mod 1000000 i++ } print sum

This translates roughly to the following pseudo-assembly:

load r0, 0 # sum = 0 load r1, 1 # i = 1 load r2, 1000000 load r3, 10000000 loop: mul r4, r1, r1 # i * i add r0, r0, r4 # sum += i * i mod r0, r0, r2 # sum %= 1000000 inc r1 gt r5, r1, r3 # if i > 10000000 jz r5, loop # then don't goto loop ret r0

So basically, this is a tight loop with 10000000 iterations. time reports that it runs for 0.47...0.52 seconds when compiled with -O2, in 1.51...1.74 seconds when compiled with -O0 -g and in 3.16...3.47 seconds when profiling (-pg) is enabled as well.

As you can see, there's a 7-fold difference between the fastest and slowest execution time.

This in itself is not that surprising, because I've known that additional debugging information and the lack of small optimizations does indeed make code run slower, but now comes the interesting part. In order to have a more realistic perception of what actually happens, I've played the same game with Lua 5.2. Fiddling around with the Makefile, I've found that the very same Lua program:

local sum = 0 for i = 1, 10000000 do sum = (sum + i * i) % 1000000 end print(sum)

runs in about 0.8...0.87 seconds when Lua is compiled with -O0 -g -pg, and in 0.39...0.43 seconds when only -O2 is enabled.

So my code seems to have a 7x speed boost when the optimizer performs tricky fix-ups on it, whereas the Lua reference implementation seems to benefit from these improvements much less.

Now my questions are:

Any idea why this happens? I suspect that the primary reason is that the creators of Lua are smarter than me and know better what the compiler and their VM is doing.
I've also caught myself thinking "well this must be premature optimization; let me just pass it to the optimizer, it will do the job anyway" several times. These include one-line static functions called in the implementation of almost every VM instruction (I thought that they will be inlined when necessary anyway), the use of various complex expressions (sometimes with not-so-easy-to-predict side effects) that can be simplified, though. Does this attitude of mine count too? (And is it the right attitude, by the way?)
And anyway, should I worry about this phenomenon at all? My code runs 1.5 times slower than Lua, after all - and that's pretty good (at least it's good enough for me). Should I try to improve the performance of the debug build just because not doing so indicates that I lack intimate knowledge of my own code? Or may I just forget about it completely as far as the release (optimized) build is fast enough?

_{(If this question would be a better fit for Prog.SE or CodeReview, let me know and I'll migrate it.)}

7x isn't quite an order of magnitude faster in out decimal-centric world. — Lara Bailey
– Lara Bailey, Commented Sep 1, 2013 at 9:51
Have you looked to assembly code for your loop with -O0 and -O2 flags? It would explain a lot (I think that gcc doesn't even use registers aggressively with -O0). As for me, I wouldn't worry much about speed of code compiled without optimization. Regarding inlining: it depends on compiler, and I agree with you: thinking of it is premature optimization, profile your final code and play with inlining then. — nullptr
– nullptr, Commented Sep 1, 2013 at 9:51
It's difficult to say without a lot more detail. I also suspect your point #1 is relevant. The compiler will do things such as move loop invariants out of loops, eliminate dead code, reorder statements for better cache coherency and memory timing, etc. It is possible (even likely) that the creators of Lua are very aware of these effects and keep them in mind with their own VM. — Chris Hayes
– Chris Hayes, Commented Sep 1, 2013 at 9:52
You are competing with a VM implementation that's been around for 20 years. Written in an era and a country where good C compilers were not yet available. So the C code was hand-tuned to be fast. It stands to reason that it will benefit less from a good C code optimizer. — Hans Passant
– Hans Passant, Commented Sep 1, 2013 at 12:19

Jason Williams · Accepted Answer · 2013-09-01 10:41:56Z

Firstly, your assertion that the algorithm is more important than the optimizations is generally true, but the same algorithm can be coded to make best or worst use of the platform you are executing on, so optimizations should always be considered... just avoid optimising prematurely, instead of avoiding optimising at all.

Next, remember that debug builds add a lot of overhead. Much more than merely disabling optimisation. To see what the optimiser is doing, use a release build with optimizations disabled.

The differences between Lua and your language will be due to the efficiency of your bytecode interpreter. A tiny inefficiency in here will have a massive effect on the speed of execution of such a large loop. you might also be able to add optimizations such as:

using "registers" to hold the variables used within the loop (in the byte code, load the variable into slot 1, then use new instructions that multiply and mod the slots using a simple array index rather than a named variable, then write the final value of the slot back to the variable at the end of the loop.
detecting that the loop executes a constant number of times, perhaps there is a way you can express this in byecode so that the loop variable and logic is executed by native code rather than by interpreting bytecode. You can obviously add special cases for lots of situations, so the trick here is to work out the most common constructs and optimise those first, to get the biggest bang for your buck.

Lastly, don't worry about the efficiency of your debug code. When you have a working interpreter, then you can profile a release build to look for areas that you can improve. Doing this prematurely will not help, as there is no point optimising partially complete code and then finding out that it needs to be changed to support a new feature. And it's only when you have a working system that you can start writing typical scripts that will exercise your interpreter in a realistic manner - you may find that optimising a loop like the example above yields no benefit in day to day scripts.

Definitely, I don't want to "avoid optimizations at all", your last paragraph is the real answer I think. Also, I don't use named variables. If you look at the "assembly" I posted, it's easy to deduce that I already use a register machine. Detecting constant iteration counts is a good idea, though.

Collectives™ on Stack Overflow

Code runs an order of magnitude faster when optimizations enabled. Am I missing something?

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related