any performance penalto to be expected with thread_local?

Question

Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?

compared to non-thread_local storage. I've attempted to clarify. — Swiss Frank
– Swiss Frank, Commented May 20, 2020 at 7:38
That's a bit like asking if a static member variable is slower/faster than a non-static one. They don't do the same thing. You'll need more details on your exact scenario. (And as always - you'll need to benchmark for your specific case anyway.) — Mat
– Mat, Commented May 20, 2020 at 7:58
Static and non-static performance should be the same, depending only whether they're in L1, L2, L3 cache or main memory, and whether subject to false-sharing, no? — Swiss Frank
– Swiss Frank, Commented May 20, 2020 at 9:04
If you're saying the answer to my question is no, @Caleth, why not just answer the question? — Swiss Frank
– Swiss Frank, Commented May 20, 2020 at 11:07

Maxim Egorushkin · Accepted Answer · 2020-05-20 11:23:09Z

On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:

#include <benchmark/benchmark.h> struct A { static unsigned n; }; unsigned A::n = 0; struct B { static thread_local unsigned n; }; thread_local unsigned B::n = 0; template<class T> void bm(benchmark::State& state) { for(auto _ : state) benchmark::DoNotOptimize(++T::n); } BENCHMARK_TEMPLATE(bm, A); BENCHMARK_TEMPLATE(bm, B); BENCHMARK_MAIN();

Results:

Run on (16 X 5000 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 256 KiB (x8) L3 Unified 16384 KiB (x1) Load Average: 0.59, 0.49, 0.38 ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- bm<A> 1.09 ns 1.09 ns 642390002 bm<B> 1.09 ns 1.09 ns 633963210

On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.

On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.

Thx Maxim! So it seems like the benchmarks are within a percent, but the extra two bytes have no effect when with such a small amount of code? I would have expected more effect from them.
Any need for the thread_local before unsigned within struct B?
@SwissFrank thread_local is required at both declaration and definition.
OK, I see, in B n is a static thread_local member, which you then init to 0. I read too fast.

Collectives™ on Stack Overflow

any performance penalto to be expected with thread_local?

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related