Why is C++ executable running so much faster when linked against newer libstdc++.so?

Question

I have a project (code here) in which I run benchmarks to compare the performance of different methods for computing dot product (Naive method, Eigen library, SIMD implementation, ect). I am testing on a fresh Centos 7.6 VM. I have noticed that when I use different versions of libstdc++.so.6, I get significantly different performance.

When I spin up a new Centos 7.6 instance, the default C++ standard library is libstdc++.so.6.0.19. When I run my benchmark executable (linked against this version of libstdc++) the output is as follows:

Naive Implementation, 1000000 iterations: 1448.74 ns average time Optimized Implementation, 1000000 iterations: 1094.2 ns average time AVX2 implementation, 1000000 iterations: 1069.57 ns average time Eigen Implementation, 1000000 iterations: 1027.21 ns average time AVX & FMA implementation 1, 1000000 iterations: 1028.68 ns average time AVX & FMA implementation 2, 1000000 iterations: 1021.26 ns average time

If I download libstdc++.so.6.0.26 and change the symbolic link libstdc++.so.6 to point to this newer library and rerun the executable (without recompiling or changing anything else), the results are as follows:

Naive Implementation, 1000000 iterations: 297.981 ns average time Optimized Implementation, 1000000 iterations: 156.649 ns average time AVX2 implementation, 1000000 iterations: 131.577 ns average time Eigen Implementation, 1000000 iterations: 92.9909 ns average time AVX & FMA implementation 1, 1000000 iterations: 78.136 ns average time AVX & FMA implementation 2, 1000000 iterations: 80.0832 ns average time

Why is there such a significant improvement in speed (some implementations are 10x faster)?

Due to my use case, I may be required to link against libstdc++.so.6.0.19. Is there anything I can do in my code / on my side to see these speed improvements while using the older version of libstdc++?

Edit: I created a minimum reproducible example.

main.cpp

#include <iostream> #include <vector> #include <cstring> #include <chrono> #include <cmath> #include <iostream> typedef std::chrono::high_resolution_clock Clock; const size_t SIZE_FLOAT = 512; double computeDotProductOptomized(const std::vector<uint8_t>& v1, const std::vector<uint8_t>& v2); void generateNormalizedData(std::vector<uint8_t>& v); int main() { // Seed for random number srand (time(nullptr)); std::vector<uint8_t> v1; std::vector<uint8_t> v2; generateNormalizedData(v1); generateNormalizedData(v2); const size_t numIterations = 10000000; double totalTime = 0.0; for (size_t i = 0; i < numIterations; ++i) { auto t1 = Clock::now(); auto similarity = computeDotProductOptomized(v1, v2); auto t2 = Clock::now(); totalTime += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count(); } std::cout << "Average Time Taken: " << totalTime / numIterations << '\n'; return 0; } double computeDotProductOptomized(const std::vector<uint8_t>& v1, const std::vector<uint8_t>& v2) { const auto *x = reinterpret_cast<const float*>(v1.data()); const auto *y = reinterpret_cast<const float*>(v2.data()); double similarity = 0; for (size_t i = 0; i < SIZE_FLOAT; ++i) { similarity += *(x + i) * *(y + i); } return similarity; } void generateNormalizedData(std::vector<uint8_t>& v) { std::vector<float> vFloat(SIZE_FLOAT); v.resize(SIZE_FLOAT * sizeof(float)); for(float & i : vFloat) { i = static_cast <float> (rand()) / static_cast <float> (RAND_MAX); } // Normalize the vector float mod = 0.0; for (float i : vFloat) { mod += i * i; } float mag = std::sqrt(mod); if (mag == 0) { throw std::logic_error("The input vector is a zero vector"); } for (float & i : vFloat) { i /= mag; } memcpy(v.data(), vFloat.data(), v.size()); }

CMakeLists.txt

cmake_minimum_required(VERSION 3.14) project(dot-prod-benchmark-min-reproducible-example C CXX) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Ofast -ffast-math -march=broadwell") set(CMAKE_BUILD_TYPE Release) set(CMAKE_CXX_STANDARD 14) add_executable(benchmark main.cpp)

Compiled on centos-release-7-6.1810.2.el7.centos.x86_64, using cmake version 3.16.2, gcc (GCC) 7.3.1 20180303 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, 4 vCPUs

Using libstdc++.so.6.0.19: Average Time Taken: 1279.41 Using libstdc++.20.6.0.26: Average Time Taken: 168.219

Even the naive implementation improved significantly, can you post that code? — gct
– gct, Commented Jan 2, 2020 at 22:35
Does your code actually work in both cases, did you check the result? — rustyx
– rustyx, Commented Jan 2, 2020 at 22:51
I would time all iterations together, not each one separately. Chrono::now() can be fast or slow, depending on implementation. — rustyx
– rustyx, Commented Jan 2, 2020 at 23:08
Try taking chrono::now() calls out of the loop to call them twice instead of 20M times. — rustyx
– rustyx, Commented Jan 2, 2020 at 23:16
I think you are running into the issue explained here: gcc.1065356.n8.nabble.com/… Without knowing the glibc version against which libstdc++ was compiled and without the configuration flags used, it is hard to tell though. As mentioned in a comment above, try running your program with strace and watch for difference in syscalls. — walnut
– walnut, Commented Jan 2, 2020 at 23:41

cyrusbehr · Accepted Answer · 2020-01-02 23:28:44Z

rustyx was correct. It was the use of auto t1 = Clock::now(); in the loop that was causing the poor performance. Once I moved the timing to outside the loop (time the total time taken) then they run equally fast:

 const size_t numIterations = 10000000; auto t1 = Clock::now(); for (size_t i = 0; i < numIterations; ++i) { auto similarity = computeDotProductOptomized(v1, v2); } auto t2 = Clock::now(); std::cout << "Total Time Taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";

Jonathan Wakely · Accepted Answer · 2020-09-29 08:45:19Z

Your old libstdc++.so comes from GCC 4.8, and in that version the Clock::now() calls make direct system calls to the kernel to get the current time.

That is much slower than using the clock_gettime function in libc, which gets the result from the kernel's vDSO library instead of making a system call. That is what the newer libstdc++.so is doing.

Unfortunately GCC 4.8.x was released before Glibc made the clock_gettime function available without linking to librt.so and so the libstdc++.so in CentOS 7 doesn't know it could use the clock_gettime in Glibc instead of a direct system call. There's a configure option that can be used when building GCC 4.8.x that tells it to look for the function in libc.so, but the CentOS 7 compiler isn't built with that option enabled. I don't think there's any way to fix that without using a different libstdc++.so library.

Collectives™ on Stack Overflow

Why is C++ executable running so much faster when linked against newer libstdc++.so?

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related