The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.
If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.
struct function_one { void operator()( int element ) { } }; extern int elements[]; extern bool condition(); template < typename DoTrue, typename DoFalse > void expensive_loop(){ DoTrue do_true; DoFalse do_false; for(int i=0; i<50; i++){ int element=elements[i]; // long computation that produce a boolean condition if (condition()){ do_true(element); // call DoTrue's operator() }else{ do_false(element); // call DoFalse's operator() } } } int main( int argc, char* argv[] ) { expensive_loop<function_one,function_one>(); return 0; }
The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.
For the example I shown, note how the function is empty. The compiler just strips away the function call and leaves the loop:
main: push rbx mov ebx, 50 .L2: call condition() sub ebx, 1 jne .L2 xor eax, eax pop rbx ret
See example in https://godbolt.org/g/hV52Nn
Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop
// File A.cpp void foo( int arg ); void bar( int arg ); extern bool condition(); extern int elements[]; void expensive_loop( void (*do_true)(int), void (*do_false)(int)){ for(int i=0; i<50; i++){ int element=elements[i]; // long computation that produce a boolean condition if (condition()){ do_true(element); }else{ do_false(element); } } } int main( int argc, char* argv[] ) { expensive_loop( foo, bar ); return 0; }
and the functions passed by argument
// File B.cpp #include <math.h> int elements[50]; bool condition() { return elements[0] == 1; } inline int foo( int arg ) { return arg%3; } inline int bar( int arg ) { return 1234%arg; }
are defined in different translation units.
0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>: 400620: 41 55 push %r13 400622: 49 89 fd mov %rdi,%r13 400625: 41 54 push %r12 400627: 49 89 f4 mov %rsi,%r12 40062a: 55 push %rbp 40062b: 53 push %rbx 40062c: bb 60 10 60 00 mov $0x601060,%ebx 400631: 48 83 ec 08 sub $0x8,%rsp 400635: eb 19 jmp 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30> 400637: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 40063e: 00 00 400640: 48 83 c3 04 add $0x4,%rbx 400644: 41 ff d5 callq *%r13 400647: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx 40064e: 74 1d je 40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d> 400650: 8b 2b mov (%rbx),%ebp 400652: e8 79 ff ff ff callq 4005d0 <condition()> 400657: 84 c0 test %al,%al 400659: 89 ef mov %ebp,%edi 40065b: 75 e3 jne 400640 <expensive_loop(void (*)(int), void (*)(int))+0x20> 40065d: 48 83 c3 04 add $0x4,%rbx 400661: 41 ff d4 callq *%r12 400664: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx 40066b: 75 e3 jne 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30> 40066d: 48 83 c4 08 add $0x8,%rsp 400671: 5b pop %rbx 400672: 5d pop %rbp 400673: 41 5c pop %r12 400675: 41 5d pop %r13 400677: c3 retq 400678: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 40067f: 00
You can see how the calls are still performed even when using -O3 optimization level:
400644: 41 ff d5 callq *%r13