Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?
- 3Performance penalty compared to what?Mat– Mat2020-05-20 07:27:22 +00:00Commented May 20, 2020 at 7:27
- compared to non-thread_local storage. I've attempted to clarify.Swiss Frank– Swiss Frank2020-05-20 07:38:14 +00:00Commented May 20, 2020 at 7:38
- 1That's a bit like asking if a static member variable is slower/faster than a non-static one. They don't do the same thing. You'll need more details on your exact scenario. (And as always - you'll need to benchmark for your specific case anyway.)Mat– Mat2020-05-20 07:58:57 +00:00Commented May 20, 2020 at 7:58
- Static and non-static performance should be the same, depending only whether they're in L1, L2, L3 cache or main memory, and whether subject to false-sharing, no?Swiss Frank– Swiss Frank2020-05-20 09:04:21 +00:00Commented May 20, 2020 at 9:04
- 1If you're saying the answer to my question is no, @Caleth, why not just answer the question?Swiss Frank– Swiss Frank2020-05-20 11:07:59 +00:00Commented May 20, 2020 at 11:07
| Show 1 more comment
1 Answer
On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:
#include <benchmark/benchmark.h> struct A { static unsigned n; }; unsigned A::n = 0; struct B { static thread_local unsigned n; }; thread_local unsigned B::n = 0; template<class T> void bm(benchmark::State& state) { for(auto _ : state) benchmark::DoNotOptimize(++T::n); } BENCHMARK_TEMPLATE(bm, A); BENCHMARK_TEMPLATE(bm, B); BENCHMARK_MAIN(); Results:
Run on (16 X 5000 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 256 KiB (x8) L3 Unified 16384 KiB (x1) Load Average: 0.59, 0.49, 0.38 ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- bm<A> 1.09 ns 1.09 ns 642390002 bm<B> 1.09 ns 1.09 ns 633963210 On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.
On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.
4 Comments
Swiss Frank
Thx Maxim! So it seems like the benchmarks are within a percent, but the extra two bytes have no effect when with such a small amount of code? I would have expected more effect from them.
Swiss Frank
Any need for the thread_local before unsigned within struct B?
Maxim Egorushkin
@SwissFrank
thread_local is required at both declaration and definition.Swiss Frank
OK, I see, in B n is a static thread_local member, which you then init to 0. I read too fast.