Well, most of the time the complier generates more instructions so that fewer of them are executed in a given run. Usually by generating specific code for different cases:
- Loop unrolling. The jump is only done in every n (usually 8) iterations.
- Function inlining. It saves the call, return, copying arguments and stack manipulation.
The other thing is that some instructions take more time than other. Especially conditions that are difficult to predict can be significantly slower. However both calls and loops are common and the predictor handles them well.
The other thing is memory cache, but here things are not so clear cut. The caches work better when the code is read linearly (functions are inlined), but it also has limited size, so larger portion of small code will be cached.
int main (int argc, char** argv) { int i = 0; int sum = 0; for (i = 0; i < 10; i++) { sum += i + 1; } printf("%d\n",sum); }-- no optimizations is 31 instructions, with -O2, its 8 instructions. Granted, not the best example, still - not always true.