Neither GCC nor Clang will inline calls through an array of function pointers known at compile time — why?

Question

Sample code on Compiler Explorer: https://godbolt.org/g/fPfw4k

I was attempting to use an array of function pointers as a jump table instead of switches as I found it to be cleaner. However, to my surprise, neither GCC nor Clang compiler seems capable of inlining this.

Is there a specific reason why?

Example code incase of dead link:

namespace{ template<int N> int bar(){ return N; } int foo1(int n){ if(n < 0 || n > 5){ __builtin_unreachable(); } #if __clang__ __builtin_assume(n >= 0 && n <= 5); #endif static int (* const fns[])() = { bar<0>, bar<1>, bar<2>, bar<3>, bar<4>, bar<5> }; return fns[n](); } int foo2(int n){ #if __clang__ __builtin_assume(n >= 0 && n <= 5); #endif switch(n){ case 0: return bar<0>(); case 1: return bar<1>(); case 2: return bar<2>(); case 3: return bar<3>(); case 4: return bar<4>(); case 5: return bar<5>(); default: __builtin_unreachable(); } } } int main(int argc, char** argv){ volatile int n = foo1(argc); volatile int p = foo2(argc); }

Using the language extension attribute always_inline provided by GCC & Clang makes no difference either.

A lookup table is on average more efficient that 5 test and branch operations on a processor that supports pipelining perhaps? — Richard Hodges
– Richard Hodges, Commented Jun 18, 2017 at 11:09
It could be my lack of knowledge on this subject, but where would you want the functions to be inlined? You specifically set function pointers to functions which you want to be resolved in runtime. In the switch case, the switch resolves the wanted action and the function is inlined in the case code block. — stefaanv
– stefaanv, Commented Jun 18, 2017 at 11:20
@dani no, you'd be trading constant time lookup for logarithmic. You'd be trading down. Lookup tables cost exactly one memory fetch. They're extremely cheap, particularly if the destination code is already in cache. — Richard Hodges
– Richard Hodges, Commented Jun 18, 2017 at 11:24
@dani both approaches above are equivalent to using a vector. If you look at the optimised assembler output, you'll see this. — Richard Hodges
– Richard Hodges, Commented Jun 18, 2017 at 11:46
If you remove the foo1 code, it does become a lookup table. If we would calculate the value of n by adding sufficient constexpr, it gets optimized. So, I suspect there are simply some missing optimization passes or some unexpected ordering. — JVApen
– JVApen, Commented Jun 18, 2017 at 13:49

PaulR · Accepted Answer · 2017-07-11 14:06:01Z

The compiler cannot inline the call in foo1 because the call does not use a compile-time constant callee. If it knows that a constant argument was passed to foo1 at compile time by inlining it, it will inline the correct function.

Consider this example:

namespace{ template<int N> int bar(){ return N; } int foo1(int n){ if(n < 0 || n > 5){ __builtin_unreachable(); } #if __clang__ __builtin_assume(n >= 0 && n <= 5); #endif static int (* const fns[])() = { bar<0>, bar<1>, bar<2>, bar<3>, bar<4>, bar<5> }; return fns[n](); } } int main(int argc, char** argv){ int n = foo1(3); return n; }

It is compiled to the following code by both compilers:

main: mov eax, 3 ret

In the case of foo2, the compiler starts out with 5 different calls with constant callees, all of which it inlines. Then it optimizes the resulting code further, generating its own jump table if it considers it profitable.

I guess the compiler could try to extract a switch from your jump table and then inline everything, but this would be quite complex and very unlikely to yield a performance improvement in the general case, so neither gcc nor clang seems to do this.

Collectives™ on Stack Overflow

Neither GCC nor Clang will inline calls through an array of function pointers known at compile time — why?

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related