0

I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:

function expensive_loop( void (*do_true)(int), void (*do_false)(int)){ for(i=0; i<VeryLargeN; i++){ element=elements[i] // long computation that produce a boolean condition if (condition){ do_true(element); }else{ do_false(element); } } } 

Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.

To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.

So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?

9
  • 5
    There is no language C/C++. Do you use C or C++? And your question is not clear. This could be an XY problem, at least in C++ where a template might help. Commented Nov 29, 2016 at 18:14
  • 2
    "Executed inline in C/C++" is the wrong terminology. Both C and C++ are usually compiled; inlining happens at the discretion of the compiler of choice, whether you specify it or not. Compile to assembler source code and you can find out. Commented Nov 29, 2016 at 18:14
  • C != C++ and the "native" solution may vary. Also, did you test your code to see what parts were slow? Commented Nov 29, 2016 at 18:14
  • 2
    The C++ language has lambdas whereas the C language doesn't. The C++ language allows for function objects, whereas the C language doesn't. Please edit your question clarifying whether you are using lambdas, function pointers, or function objects. Commented Nov 29, 2016 at 18:28
  • 2
    Have you profiled your function to determine where the execution time is spent? I would believe that dereferencing function pointers would not be where most of the time is spent (unless the passed functions are gobbling up execution time). Commented Nov 29, 2016 at 18:32

2 Answers 2

3

Note that the function accepts pointers to functions, so those get called through a pointer. The optimizer may inline those calls through the function pointers if the definitions of expensive_loop and those functions are available and the compiler inlining limits have not been breached.

Another option is to make this algorithm a function template that accepts callable objects (function pointers, objects with a call operator, lambdas), just like standard algorithms do. This way the compiler may have more optimization opportunities. E.g.:

template<class DoTrue, class DoFalse> void expensive_loop(DoTrue do_true, DoFalse do_false) { // Original function body here. } 

There is -Winline compiler switch for g++:

-Winline

Warn if a function can not be inlined and it was declared as inline. Even with this option, the compiler will not warn about failures to inline functions declared in system headers.

The compiler uses a variety of heuristics to determine whether or not to inline a function. For example, the compiler takes into account the size of the function being inlined and the the amount of inlining that has already been done in the current function. Therefore, seemingly insignificant changes in the source program can cause the warnings produced by -Winline to appear or disappear.

It probably does not warn about a function not being inlined when it is called through a pointer.

Sign up to request clarification or add additional context in comments.

14 Comments

The definition need to be available, and typically expensive_loop will need to be inlined for do_true etc. to be too. If it's anything like std::sort, you likely won't have inlining, unlike the function object case.
@T.C. Yep, there are compiler inlining limits.
My understanding is that in order to pass a function by pointer (address), it must exist somewhere (and not inlined). So how can a compiler inline the function pointed to by a function pointer?
@ThomasMatthews It uses constant propagation for that. In f(g) the compiler knows that it is the address of g and it can just call/inline g directly when inlining f.
If the functions do_true and do_false are declared on a different translation unit, they are not inlined into expensive_loop. -Winline does not warn about this inlining not being performed.
|
2

The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.

If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.

struct function_one { void operator()( int element ) { } }; extern int elements[]; extern bool condition(); template < typename DoTrue, typename DoFalse > void expensive_loop(){ DoTrue do_true; DoFalse do_false; for(int i=0; i<50; i++){ int element=elements[i]; // long computation that produce a boolean condition if (condition()){ do_true(element); // call DoTrue's operator() }else{ do_false(element); // call DoFalse's operator() } } } int main( int argc, char* argv[] ) { expensive_loop<function_one,function_one>(); return 0; } 

The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.

For the example I shown, note how the function is empty. The compiler just strips away the function call and leaves the loop:

main: push rbx mov ebx, 50 .L2: call condition() sub ebx, 1 jne .L2 xor eax, eax pop rbx ret 

See example in https://godbolt.org/g/hV52Nn

Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop

// File A.cpp void foo( int arg ); void bar( int arg ); extern bool condition(); extern int elements[]; void expensive_loop( void (*do_true)(int), void (*do_false)(int)){ for(int i=0; i<50; i++){ int element=elements[i]; // long computation that produce a boolean condition if (condition()){ do_true(element); }else{ do_false(element); } } } int main( int argc, char* argv[] ) { expensive_loop( foo, bar ); return 0; } 

and the functions passed by argument

// File B.cpp #include <math.h> int elements[50]; bool condition() { return elements[0] == 1; } inline int foo( int arg ) { return arg%3; } inline int bar( int arg ) { return 1234%arg; } 

are defined in different translation units.

0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>: 400620: 41 55 push %r13 400622: 49 89 fd mov %rdi,%r13 400625: 41 54 push %r12 400627: 49 89 f4 mov %rsi,%r12 40062a: 55 push %rbp 40062b: 53 push %rbx 40062c: bb 60 10 60 00 mov $0x601060,%ebx 400631: 48 83 ec 08 sub $0x8,%rsp 400635: eb 19 jmp 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30> 400637: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 40063e: 00 00 400640: 48 83 c3 04 add $0x4,%rbx 400644: 41 ff d5 callq *%r13 400647: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx 40064e: 74 1d je 40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d> 400650: 8b 2b mov (%rbx),%ebp 400652: e8 79 ff ff ff callq 4005d0 <condition()> 400657: 84 c0 test %al,%al 400659: 89 ef mov %ebp,%edi 40065b: 75 e3 jne 400640 <expensive_loop(void (*)(int), void (*)(int))+0x20> 40065d: 48 83 c3 04 add $0x4,%rbx 400661: 41 ff d4 callq *%r12 400664: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx 40066b: 75 e3 jne 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30> 40066d: 48 83 c4 08 add $0x8,%rsp 400671: 5b pop %rbx 400672: 5d pop %rbp 400673: 41 5c pop %r12 400675: 41 5d pop %r13 400677: c3 retq 400678: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 40067f: 00 

You can see how the calls are still performed even when using -O3 optimization level:

400644: 41 ff d5 callq *%r13 

2 Comments

You may be surprised to learn that modern compilers do optimize calls through function pointers and member function pointers.
They may do, but this way you make sure it does what you expect (for example, it makes sure that the definition is available in the compilation unit of expensive_loop definition while, on the other case it is only required at the time the latter function is called.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.