0

Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?

6
  • 3
    Performance penalty compared to what? Commented May 20, 2020 at 7:27
  • compared to non-thread_local storage. I've attempted to clarify. Commented May 20, 2020 at 7:38
  • 1
    That's a bit like asking if a static member variable is slower/faster than a non-static one. They don't do the same thing. You'll need more details on your exact scenario. (And as always - you'll need to benchmark for your specific case anyway.) Commented May 20, 2020 at 7:58
  • Static and non-static performance should be the same, depending only whether they're in L1, L2, L3 cache or main memory, and whether subject to false-sharing, no? Commented May 20, 2020 at 9:04
  • 1
    If you're saying the answer to my question is no, @Caleth, why not just answer the question? Commented May 20, 2020 at 11:07

1 Answer 1

3

On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:

#include <benchmark/benchmark.h> struct A { static unsigned n; }; unsigned A::n = 0; struct B { static thread_local unsigned n; }; thread_local unsigned B::n = 0; template<class T> void bm(benchmark::State& state) { for(auto _ : state) benchmark::DoNotOptimize(++T::n); } BENCHMARK_TEMPLATE(bm, A); BENCHMARK_TEMPLATE(bm, B); BENCHMARK_MAIN(); 

Results:

Run on (16 X 5000 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 256 KiB (x8) L3 Unified 16384 KiB (x1) Load Average: 0.59, 0.49, 0.38 ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- bm<A> 1.09 ns 1.09 ns 642390002 bm<B> 1.09 ns 1.09 ns 633963210 

On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.

On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.

Sign up to request clarification or add additional context in comments.

4 Comments

Thx Maxim! So it seems like the benchmarks are within a percent, but the extra two bytes have no effect when with such a small amount of code? I would have expected more effect from them.
Any need for the thread_local before unsigned within struct B?
@SwissFrank thread_local is required at both declaration and definition.
OK, I see, in B n is a static thread_local member, which you then init to 0. I read too fast.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.