Why is a dynamically linked executable noticeably slower than the statically linked one in Linux?

Question

Testing with a toy program that determines the result of a tic-tac-toe board, I got this. What's making this quite big difference? I'd suspect that the calls to rand is faster with a statically linked libc, but still surprised with the result.

~$ gcc a.c -std=c11 -O3 ~$ time ./a.out 32614644 real 0m9.396s user 0m9.388s sys 0m0.004s ~$ gcc a.c -std=c11 -O3 -static ~$ time ./a.out 32614644 real 0m6.891s user 0m6.884s sys 0m0.000s

#include <stdio.h> #include <stdlib.h> #define SIZE 3 #define SIZE_2 (SIZE * SIZE) static int determineResult(int board[static SIZE_2]) { for (int i = 0; i < SIZE_2; i += SIZE) { if (!board[i]) { continue; } for (int j = i + 1; j < i + SIZE; ++j) { if (board[i] != board[j]) { goto next; } } return board[i]; next:; } for (int i = 0; i < SIZE; ++i) { if (!board[i]) { continue; } for (int j = i + SIZE; j < i + SIZE_2; j += SIZE) { if (board[i] != board[j]) { goto next2; } } return board[i]; next2:; } for (int i = SIZE + 1; i < SIZE_2; i += SIZE + 1) { if (board[i] != *board) { goto next3; } } return *board; next3: for (int i = SIZE * 2 - 2; i <= SIZE_2 - SIZE; i += SIZE - 1) { if (board[i] != board[SIZE - 1]) { return 0; } } return board[SIZE - 1]; } #define N 50000000 int main(void) { srand(0); size_t n = 0; for (int i = 0; i < N; ++i) { int board[SIZE_2]; for (int i = 0; i < SIZE_2; ++i) { board[i] = rand() % 3; } n += determineResult(board); } printf("%zu\n", n); return EXIT_SUCCESS; }

You are benchmarking the load-time which includes loading the dynamic libraries. Once loaded, there is no difference. Just add a loop in main which executes the relevant code a significant number of times. — too honest for this site
– too honest for this site, Commented Sep 13, 2015 at 0:02
@AndrewHenle Profiling doesn't really help, it just tells me that rand or more exactly random_r is consuming most of the time. — user3810155
– user3810155, Commented Sep 13, 2015 at 0:06
Sorry, I'm no clairvoyant - working on it. Why don't you just try? Profile inside your program. — too honest for this site
– too honest for this site, Commented Sep 13, 2015 at 0:08

R.. GitHub STOP HELPING ICE · Accepted Answer · 2015-09-13 02:27:26Z

I can't be sure this is the reason without knowing the particular ABI (which depends on OS and cpu architecture) your system is using, but the following is the most likely explanation.

On most implementations, code in shared libraries (including shared libc.so) has to be position-independent code. This means it can be loaded at any address (rather than assigned a fixed run-time address by the linker) and thus cannot use hard-coded absolute data addresses in the machine code. Instead, it has to access global data via either instruction-pointer-relative addressing or a global offset table (GOT) whose address is either kept in a register or computed relative to the instruction pointer. These addressing modes are efficient mainly on well-designed modern instruction set architectures like x86_64, AArch64, RISC-V, etc. On most other architectures, including 32-bit x86, they're quite inefficient. For example, the following function:

int x; int get_x() { return x; }

will balloon into something like the following on x86:

get_x: push %ebp mov %esp, %ebp push %ebx sub $4, %esp call __x86.get_pc_thunk_bx add $_GLOBAL_OFFSET_TABLE_, %ebx mov x@GOT(%ebx), %eax mov (%eax),%eax add $4, %esp pop %ebx pop %ebp ret

whereas you would expect (for non-position-independent-code) to see:

get_x: mov x, %eax ret

Being that random number generators have internal (global) state, they're stuck doing this expensive dance for position-independent code. And being that the actual computation they do is likely very short and fast, the PIC overhead is probably a significant part of their run time.

One way to confirm this theory would be to try using rand_r or random_r instead. These functions use caller-provided state and thus can (at least in theory) avoid any internal access to global data.

+1 for not only providing a plausible theory, but also providing a simple way for the OP to test that theory.

Wolph · Accepted Answer · 2015-09-13 00:07:53Z

The problem here is that you are comparing the total execution time, in a small example like this it will be greatly superior for the static linking example since there are no lookups to be done.

The big difference between static and dynamic linking is that dynamic linking has several modules/objects which are linked together at runtime and that statically compiled binaries have everything contained within the binary. There are some specifics that can vary of course, but that's roughly it.

So... taken the above into consideration, starting an executable, loading a few different files, executing your function and returning is probably going to take more time than load the file and execute your function.

The difference do scale. With N doubled, the execution time is now 13.7 seconds for static and 18.4 seconds for dynamic.
In that case there is probably some dynamic lookup involved at runtime for every function call.

Collectives™ on Stack Overflow

Why is a dynamically linked executable noticeably slower than the statically linked one in Linux?

2 Answers 2

2 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Related