In the critpath of a threaded app I've encountered a performance bottleneck. profiling indicates unexpected contention on cache lines despite what appears to be independent data access across different threads. Given the following code:
#include <thread> #include <vector> #include <atomic> #include <chrono> #include <iostream> struct Base { virtual ~Base() = default; }; struct DerivedA : public Base { alignas(64) std::atomic<uint64_t> counter_a{0}; }; struct DerivedB : public Base { alignas(64) std::atomic<uint64_t> counter_b{0}; }; void increment_a(DerivedA* da, int iterations) { for (int i = 0; i < iterations; ++i) { da->counter_a.fetch_add(1, std::memory_order_relaxed); } } void increment_b(DerivedB* db, int iterations) { for (int i = 0; i < iterations; ++i) { db->counter_b.fetch_add(1, std::memory_order_relaxed); } } int main() { std::vector<std::unique_ptr<Base>> objects; objects.emplace_back(std::make_unique<DerivedA>()); objects.emplace_back(std::make_unique<DerivedB>()); auto* derived_a = static_cast<DerivedA*>(objects[0].get()); auto* derived_b = static_cast<DerivedB*>(objects[1].get()); int iterations = 10'000'000; std::thread t1(increment_a, derived_a, iterations); std::thread t2(increment_b, derived_b, iterations); t1.join(); t2.join(); std::cout << "Counter A: " << derived_a->counter_a << std::endl; std::cout << "Counter B: " << derived_b->counter_b << std::endl; return 0; } DerivedA::counter_a and DerivedB::counter_b are explicitly alignas(64) to prevent false sharing. However, profiling tools (e.g., perf) still indicate a higher-than-expected level of cache line invalidations and contention between the threads incrementing these seemingly independent counters.
Per results:
sudo perf stat -e cache-references,cache-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses ./program Counter A: 10000000 Counter B: 10000000 Performance counter stats for './program': 347574 cache-references (66.06%) 119491 cache-misses # 34.379 % of all cache refs (66.06%) 21428394 dTLB-loads (66.06%) 3816 dTLB-load-misses # 0.02% of all dTLB cache accesses (66.06%) 19899060 dTLB-stores (69.41%) 87 dTLB-store-misses (69.42%) 0.052937856 seconds time elapsed 0.095077000 seconds user 0.000000000 seconds sys Given that the inheritance structure and heap allocation pattern (potentially placing DerivedA and DerivedB objects contiguously in memory), is it possible that the base class Base, though being empty, is contributing to these objects residing on the same cache line in a way that the alignas specifier on the derived members is insufficient to prevent false sharing between counter_a and counter_b accessed by different threads?