Redid the test so it is easier to repeat. Made a few decisions to ensure the tests would actually do what they claim to do even with future optimisations. I've modelled a common scenario: A function from a different compilation unit returns a new shared_ptr for you to use, and you pass it on to another function (here modelled as a constructor of a struct we store).
So the trivial copy-only model should do 3 copies, a fully move-enabled one only 1 (with 2 moves); and one in-between that is using one move only due to how the interface is designed - very common to forget/purposefully design a function to either take a new shared_ptr or consume one.
I believe this models a common, or at least valid use case.
#include <chrono> #include <iostream> #include <vector> #include <memory> typedef std::chrono::steady_clock sclock; static std::shared_ptr<int> source; __attribute__((noinline)) static std::shared_ptr<int> get_copy() { return source; } struct KeeperCopy { std::shared_ptr<int> kept; __attribute__((noinline)) KeeperCopy(std::shared_ptr<int> ptr) { kept = ptr; }; }; struct KeeperMoveInternal { std::shared_ptr<int> kept; __attribute__((noinline)) KeeperMoveInternal(std::shared_ptr<int> ptr) { kept = std::move(ptr); }; }; struct KeeperMoveExternal { std::shared_ptr<int> kept; __attribute__((noinline)) KeeperMoveExternal(std::shared_ptr<int> &&ptr) { kept = std::move(ptr); }; }; int COUNT = 50'000'000; void benchmark_copy() { // three copies thread_local std::vector<KeeperCopy> vec_dst; vec_dst.clear(); vec_dst.reserve(COUNT); sclock::time_point start = sclock::now(); for(int i = 0; i < COUNT; i++) { vec_dst.emplace_back(get_copy()); } std::cout << "benchmark_copy_only: " << std::chrono::duration_cast<std::chrono::milliseconds>(sclock::now()-start).count() << "ms" << std::endl; } void benchmark_move_internal() { // two copies thread_local std::vector<KeeperMoveInternal> vec_dst; vec_dst.clear(); vec_dst.reserve(COUNT); sclock::time_point start = sclock::now(); for(int i = 0; i < COUNT; i++) { vec_dst.emplace_back(get_copy()); } std::cout << "benchmark_move_int : " << std::chrono::duration_cast<std::chrono::milliseconds>(sclock::now()-start).count() << "ms" << std::endl; } void benchmark_move_external() { // one copy thread_local std::vector<KeeperMoveExternal> vec_dst; vec_dst.clear(); vec_dst.reserve(COUNT); sclock::time_point start = sclock::now(); for(int i = 0; i < COUNT; i++) { vec_dst.emplace_back(std::move(get_copy())); } std::cout << "benchmark_move_ext : " << std::chrono::duration_cast<std::chrono::milliseconds>(sclock::now()-start).count() << "ms" << std::endl; } int main (int arg, char **argv){ source = std::shared_ptr<int>(new int); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); // use_count should be settled now std::cout << "use count settled: " << source.use_count() << std::endl; benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); }
My results with g++ (g++ -std=c++20 -O3 -o test -DNDEBUG test.cpp):
benchmark_copy_only: 412ms benchmark_move_int : 236ms benchmark_move_ext : 147ms use count settled: 150000001 benchmark_copy_only: 317ms benchmark_move_int : 177ms benchmark_move_ext : 114ms benchmark_copy_only: 311ms benchmark_move_int : 181ms benchmark_move_ext : 118ms benchmark_copy_only: 312ms benchmark_move_int : 179ms benchmark_move_ext : 117ms benchmark_copy_only: 314ms benchmark_move_int : 175ms benchmark_move_ext : 118ms
My results with clang++ (clang++ -std=c++20 -O3 -o test -DNDEBUG test.cpp):
benchmark_copy_only: 782ms benchmark_move_int : 444ms benchmark_move_ext : 350ms use count settled: 150000001 benchmark_copy_only: 718ms benchmark_move_int : 358ms benchmark_move_ext : 269ms benchmark_copy_only: 691ms benchmark_move_int : 359ms benchmark_move_ext : 272ms benchmark_copy_only: 687ms benchmark_move_int : 364ms benchmark_move_ext : 269ms benchmark_copy_only: 691ms benchmark_move_int : 356ms benchmark_move_ext : 270ms
So about 312/180/116 for g++ and 690/360/270 for clang, on a modern AMD chip. Other architectures will behave differently. It's clear the jump from copy to move_int is much more significant than from move_int to move_ext for both compilers. So maybe the copy returned from get_copy() is somehow passed on to the constructor in a more optimised way.
While it's not clear how much this impacts performance in YOUR code or on YOUR platform, it's clear that moving them around is good practice, even if just to make the ownership explicit.
shared_ptrhas two levels of indirection. One points at a data structure with an atomic counter. The next one points from there at the actual object. Moving a shared pointer involves a simple copy of the first pointer with nothing else to be done. The reference counter stays unchanged, because the number ofshared_ptrs referencing it doesn’t change. Copying is much harder: Apart from copying a pointer, one must (also) atomically increment the reference counter in a thread-safe manner. An atomic instruction may take time equivalent to thousands of “non-atomic” instructions.