18

In the critpath of a threaded app I've encountered a performance bottleneck. profiling indicates unexpected contention on cache lines despite what appears to be independent data access across different threads. Given the following code:

#include <thread> #include <vector> #include <atomic> #include <chrono> #include <iostream> struct Base { virtual ~Base() = default; }; struct DerivedA : public Base { alignas(64) std::atomic<uint64_t> counter_a{0}; }; struct DerivedB : public Base { alignas(64) std::atomic<uint64_t> counter_b{0}; }; void increment_a(DerivedA* da, int iterations) { for (int i = 0; i < iterations; ++i) { da->counter_a.fetch_add(1, std::memory_order_relaxed); } } void increment_b(DerivedB* db, int iterations) { for (int i = 0; i < iterations; ++i) { db->counter_b.fetch_add(1, std::memory_order_relaxed); } } int main() { std::vector<std::unique_ptr<Base>> objects; objects.emplace_back(std::make_unique<DerivedA>()); objects.emplace_back(std::make_unique<DerivedB>()); auto* derived_a = static_cast<DerivedA*>(objects[0].get()); auto* derived_b = static_cast<DerivedB*>(objects[1].get()); int iterations = 10'000'000; std::thread t1(increment_a, derived_a, iterations); std::thread t2(increment_b, derived_b, iterations); t1.join(); t2.join(); std::cout << "Counter A: " << derived_a->counter_a << std::endl; std::cout << "Counter B: " << derived_b->counter_b << std::endl; return 0; } 

Compiler explorer

DerivedA::counter_a and DerivedB::counter_b are explicitly alignas(64) to prevent false sharing. However, profiling tools (e.g., perf) still indicate a higher-than-expected level of cache line invalidations and contention between the threads incrementing these seemingly independent counters.

Per results:

sudo perf stat -e cache-references,cache-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses ./program Counter A: 10000000 Counter B: 10000000 Performance counter stats for './program': 347574 cache-references (66.06%) 119491 cache-misses # 34.379 % of all cache refs (66.06%) 21428394 dTLB-loads (66.06%) 3816 dTLB-load-misses # 0.02% of all dTLB cache accesses (66.06%) 19899060 dTLB-stores (69.41%) 87 dTLB-store-misses (69.42%) 0.052937856 seconds time elapsed 0.095077000 seconds user 0.000000000 seconds sys 

Given that the inheritance structure and heap allocation pattern (potentially placing DerivedA and DerivedB objects contiguously in memory), is it possible that the base class Base, though being empty, is contributing to these objects residing on the same cache line in a way that the alignas specifier on the derived members is insufficient to prevent false sharing between counter_a and counter_b accessed by different threads?

3
  • 2
    Did you compile with optimizations turned on? Commented Apr 19 at 2:37
  • Compiler explorer shows the code doing what you want it to do, both with no optimizations and with -O2 (namely, it uses alignment-aware new operator, allocating 128 bytes for each object with explicit 64 byte alignment). However, make sure your own compiler does that too. Drill down into the generated assembly and see exactly how your DerivedA/B objects are being allocated. E.g. same gcc14 in CompilerExporer would not use aligned-new when set to C++14 mode (it still uses 128 bytes per object so there should not be false sharing either, but the allocation might not be aligned to the cacheline). Commented Apr 19 at 2:46
  • 11
    Also note Base is definitely not empty. Virtual destructor causes it to become a virtual class, so it has a virtual table. It is 8 bytes in size. Commented Apr 19 at 2:53

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.