7

Sample code on Compiler Explorer: https://godbolt.org/g/fPfw4k

I was attempting to use an array of function pointers as a jump table instead of switches as I found it to be cleaner. However, to my surprise, neither GCC nor Clang compiler seems capable of inlining this.

Is there a specific reason why?

Example code incase of dead link:

namespace{ template<int N> int bar(){ return N; } int foo1(int n){ if(n < 0 || n > 5){ __builtin_unreachable(); } #if __clang__ __builtin_assume(n >= 0 && n <= 5); #endif static int (* const fns[])() = { bar<0>, bar<1>, bar<2>, bar<3>, bar<4>, bar<5> }; return fns[n](); } int foo2(int n){ #if __clang__ __builtin_assume(n >= 0 && n <= 5); #endif switch(n){ case 0: return bar<0>(); case 1: return bar<1>(); case 2: return bar<2>(); case 3: return bar<3>(); case 4: return bar<4>(); case 5: return bar<5>(); default: __builtin_unreachable(); } } } int main(int argc, char** argv){ volatile int n = foo1(argc); volatile int p = foo2(argc); } 

Using the language extension attribute always_inline provided by GCC & Clang makes no difference either.

10
  • 1
    A lookup table is on average more efficient that 5 test and branch operations on a processor that supports pipelining perhaps? Commented Jun 18, 2017 at 11:09
  • 1
    It could be my lack of knowledge on this subject, but where would you want the functions to be inlined? You specifically set function pointers to functions which you want to be resolved in runtime. In the switch case, the switch resolves the wanted action and the function is inlined in the case code block. Commented Jun 18, 2017 at 11:20
  • 1
    @dani no, you'd be trading constant time lookup for logarithmic. You'd be trading down. Lookup tables cost exactly one memory fetch. They're extremely cheap, particularly if the destination code is already in cache. Commented Jun 18, 2017 at 11:24
  • 1
    @dani both approaches above are equivalent to using a vector. If you look at the optimised assembler output, you'll see this. Commented Jun 18, 2017 at 11:46
  • 1
    If you remove the foo1 code, it does become a lookup table. If we would calculate the value of n by adding sufficient constexpr, it gets optimized. So, I suspect there are simply some missing optimization passes or some unexpected ordering. Commented Jun 18, 2017 at 13:49

1 Answer 1

1

The compiler cannot inline the call in foo1 because the call does not use a compile-time constant callee. If it knows that a constant argument was passed to foo1 at compile time by inlining it, it will inline the correct function.

Consider this example:

namespace{ template<int N> int bar(){ return N; } int foo1(int n){ if(n < 0 || n > 5){ __builtin_unreachable(); } #if __clang__ __builtin_assume(n >= 0 && n <= 5); #endif static int (* const fns[])() = { bar<0>, bar<1>, bar<2>, bar<3>, bar<4>, bar<5> }; return fns[n](); } } int main(int argc, char** argv){ int n = foo1(3); return n; } 

It is compiled to the following code by both compilers:

main: mov eax, 3 ret 

In the case of foo2, the compiler starts out with 5 different calls with constant callees, all of which it inlines. Then it optimizes the resulting code further, generating its own jump table if it considers it profitable.

I guess the compiler could try to extract a switch from your jump table and then inline everything, but this would be quite complex and very unlikely to yield a performance improvement in the general case, so neither gcc nor clang seems to do this.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.