As far as my (limited) knowledge of optimization goes, I always thought that compiler optimizations don't matter that much. I mean, we can surely gain quite a few percents (maybe 0...100% as a very rough first thoughtproximation) of time by using registers instead of memory and unrolling loops, but the fundamental factor that constrains the performance of code is really the choice of algorithms.
I've recently started a pet project - a small interpreted scripting language. It compiles to bytecode and executes it using a virtual machine. For the sake of easy debugging, profiling and testing, first I've naturally compiled the code with the -O0 -g flags of gcc (and clang), then with -O2. I've then time'd a small program running on the VM, which basically does this (pseudo-code, I'm not showing you the actual syntax until the project goes public):
i = 1 sum = 0 while (i < 10000000) { sum = (sum + i * i) mod 1000000 i++ } print sum This translates roughly to the following pseudo-assembly:
load r0, 0 # sum = 0 load r1, 1 # i = 1 load r2, 1000000 load r3, 10000000 loop: mul r4, r1, r1 # i * i add r0, r0, r4 # sum += i * i mod r0, r0, r2 # sum %= 1000000 inc r1 gt r5, r1, r3 # if i > 10000000 jz r5, loop # then don't goto loop ret r0 So basically, this is a tight loop with 10000000 iterations. time reports that it runs for 0.47...0.52 seconds when compiled with -O2, in 1.51...1.74 seconds when compiled with -O0 -g and in 3.16...3.47 seconds when profiling (-pg) is enabled as well.
As you can see, there's a 7-fold difference between the fastest and slowest execution time.
This in itself is not that surprising, because I've known that additional debugging information and the lack of small optimizations does indeed make code run slower, but now comes the interesting part. In order to have a more realistic perception of what actually happens, I've played the same game with Lua 5.2. Fiddling around with the Makefile, I've found that the very same Lua program:
local sum = 0 for i = 1, 10000000 do sum = (sum + i * i) % 1000000 end print(sum) runs in about 0.8...0.87 seconds when Lua is compiled with -O0 -g -pg, and in 0.39...0.43 seconds when only -O2 is enabled.
So my code seems to have a 7x speed boost when the optimizer performs tricky fix-ups on it, whereas the Lua reference implementation seems to benefit from these improvements much less.
Now my questions are:
Any idea why this happens? I suspect that the primary reason is that the creators of Lua are smarter than me and know better what the compiler and their VM is doing.
I've also caught myself thinking "well this must be premature optimization; let me just pass it to the optimizer, it will do the job anyway" several times. These include one-line static functions called in the implementation of almost every VM instruction (I thought that they will be inlined when necessary anyway), the use of various complex expressions (sometimes with not-so-easy-to-predict side effects) that can be simplified, though. Does this attitude of mine count too? (And is it the right attitude, by the way?)
And anyway, should I worry about this phenomenon at all? My code runs 1.5 times slower than Lua, after all - and that's pretty good (at least it's good enough for me). Should I try to improve the performance of the debug build just because not doing so indicates that I lack intimate knowledge of my own code? Or may I just forget about it completely as far as the release (optimized) build is fast enough?
(If this question would be a better fit for Prog.SE or CodeReview, let me know and I'll migrate it.)