Cost of copy vs move std::shared_ptr

Question

Why would I std::move an std::shared_ptr?

Answers to this question point out that moving a std::shared_ptr is all about speed, but nobody explains why it is faster in detail. How expensive is it really in comparison? Is it worth optimzing when one uses it a lot?

Its worth optimizing when profiling shows its an bottleneck in your application, otherwise its not. — tkausl
– tkausl, Commented Nov 21, 2021 at 11:14
As the accepted answer (and comments) goes into detail about the cost of synchronising the counter of the shared pointer, could you change your question so it's not a duplicate ? — Richard Critten
– Richard Critten, Commented Nov 21, 2021 at 11:18
@UnholySheep is it twice dozens hundreds faster? if its exactly 100 times faster as stated in this answer it would be nice to have some supporting data. — ManuelSchneid3r
– ManuelSchneid3r, Commented Nov 21, 2021 at 11:28
A shared_ptr has two levels of indirection. One points at a data structure with an atomic counter. The next one points from there at the actual object. Moving a shared pointer involves a simple copy of the first pointer with nothing else to be done. The reference counter stays unchanged, because the number of shared_ptrs referencing it doesn’t change. Copying is much harder: Apart from copying a pointer, one must (also) atomically increment the reference counter in a thread-safe manner. An atomic instruction may take time equivalent to thousands of “non-atomic” instructions. — Andrej Podzimek
– Andrej Podzimek, Commented Nov 21, 2021 at 11:34

ManuelSchneid3r · Accepted Answer · 2021-11-21 12:10:40Z

I wrote a benchmark. On my Macbook Air it is three times faster (g++ as well as clang++ -std=c++17 -O3 -DNDEBUG). Let me know if you see problems with the benchmark.

#include <chrono> #include <iostream> #include <vector> #include <memory> using namespace std; using namespace std::chrono; int COUNT = 50'000'000; struct TimeIt { system_clock::time_point start; TimeIt() { start = system_clock::now(); } ~TimeIt() { auto runtime = duration_cast<milliseconds>(system_clock::now()-start).count(); cout << runtime << " ms" << endl; } }; void benchmark_copy(const vector<shared_ptr<int>> &vec_src) { cout << "benchmark_copy" << endl; vector<shared_ptr<int>> vec_dst; vec_dst.reserve(COUNT); TimeIt ti; for(auto &sp : vec_src) vec_dst.emplace_back(sp); } void benchmark_move(vector<shared_ptr<int>> &&vec_src) { cout << "benchmark_move" << endl; vector<shared_ptr<int>> vec_dst; vec_dst.reserve(COUNT); TimeIt ti; for(auto &sp : vec_src) vec_dst.emplace_back(move(sp)); } int main (int arg, char **argv){ vector<shared_ptr<int>> vec; for (int i = 0; i < COUNT; ++i) vec.emplace_back(new int); benchmark_copy(vec); benchmark_move(move(vec)); }

Made a newer version as this one was unreliable for me and repeating the tests doesn't work with this design. Also optimisations make it unclear what actually happens here.
gnu's libstdc++ contains a check, whether the program is single or multithreaded, therefore this benchmark yields wrong results: snf.github.io/2019/02/13/shared-ptr-optimization

Seneral · Accepted Answer · 2024-11-27 18:14:05Z

Redid the test so it is easier to repeat. Made a few decisions to ensure the tests would actually do what they claim to do even with future optimisations. I've modelled a common scenario: A function from a different compilation unit returns a new shared_ptr for you to use, and you pass it on to another function (here modelled as a constructor of a struct we store).

So the trivial copy-only model should do 3 copies, a fully move-enabled one only 1 (with 2 moves); and one in-between that is using one move only due to how the interface is designed - very common to forget/purposefully design a function to either take a new shared_ptr or consume one.

I believe this models a common, or at least valid use case.

#include <chrono> #include <iostream> #include <vector> #include <memory> typedef std::chrono::steady_clock sclock; static std::shared_ptr<int> source; __attribute__((noinline)) static std::shared_ptr<int> get_copy() { return source; } struct KeeperCopy { std::shared_ptr<int> kept; __attribute__((noinline)) KeeperCopy(std::shared_ptr<int> ptr) { kept = ptr; }; }; struct KeeperMoveInternal { std::shared_ptr<int> kept; __attribute__((noinline)) KeeperMoveInternal(std::shared_ptr<int> ptr) { kept = std::move(ptr); }; }; struct KeeperMoveExternal { std::shared_ptr<int> kept; __attribute__((noinline)) KeeperMoveExternal(std::shared_ptr<int> &&ptr) { kept = std::move(ptr); }; }; int COUNT = 50'000'000; void benchmark_copy() { // three copies thread_local std::vector<KeeperCopy> vec_dst; vec_dst.clear(); vec_dst.reserve(COUNT); sclock::time_point start = sclock::now(); for(int i = 0; i < COUNT; i++) { vec_dst.emplace_back(get_copy()); } std::cout << "benchmark_copy_only: " << std::chrono::duration_cast<std::chrono::milliseconds>(sclock::now()-start).count() << "ms" << std::endl; } void benchmark_move_internal() { // two copies thread_local std::vector<KeeperMoveInternal> vec_dst; vec_dst.clear(); vec_dst.reserve(COUNT); sclock::time_point start = sclock::now(); for(int i = 0; i < COUNT; i++) { vec_dst.emplace_back(get_copy()); } std::cout << "benchmark_move_int : " << std::chrono::duration_cast<std::chrono::milliseconds>(sclock::now()-start).count() << "ms" << std::endl; } void benchmark_move_external() { // one copy thread_local std::vector<KeeperMoveExternal> vec_dst; vec_dst.clear(); vec_dst.reserve(COUNT); sclock::time_point start = sclock::now(); for(int i = 0; i < COUNT; i++) { vec_dst.emplace_back(std::move(get_copy())); } std::cout << "benchmark_move_ext : " << std::chrono::duration_cast<std::chrono::milliseconds>(sclock::now()-start).count() << "ms" << std::endl; } int main (int arg, char **argv){ source = std::shared_ptr<int>(new int); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); // use_count should be settled now std::cout << "use count settled: " << source.use_count() << std::endl; benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); benchmark_copy(); benchmark_move_internal(); benchmark_move_external(); }

My results with g++ (g++ -std=c++20 -O3 -o test -DNDEBUG test.cpp):

benchmark_copy_only: 412ms benchmark_move_int : 236ms benchmark_move_ext : 147ms use count settled: 150000001 benchmark_copy_only: 317ms benchmark_move_int : 177ms benchmark_move_ext : 114ms benchmark_copy_only: 311ms benchmark_move_int : 181ms benchmark_move_ext : 118ms benchmark_copy_only: 312ms benchmark_move_int : 179ms benchmark_move_ext : 117ms benchmark_copy_only: 314ms benchmark_move_int : 175ms benchmark_move_ext : 118ms

My results with clang++ (clang++ -std=c++20 -O3 -o test -DNDEBUG test.cpp):

benchmark_copy_only: 782ms benchmark_move_int : 444ms benchmark_move_ext : 350ms use count settled: 150000001 benchmark_copy_only: 718ms benchmark_move_int : 358ms benchmark_move_ext : 269ms benchmark_copy_only: 691ms benchmark_move_int : 359ms benchmark_move_ext : 272ms benchmark_copy_only: 687ms benchmark_move_int : 364ms benchmark_move_ext : 269ms benchmark_copy_only: 691ms benchmark_move_int : 356ms benchmark_move_ext : 270ms

So about 312/180/116 for g++ and 690/360/270 for clang, on a modern AMD chip. Other architectures will behave differently. It's clear the jump from copy to move_int is much more significant than from move_int to move_ext for both compilers. So maybe the copy returned from get_copy() is somehow passed on to the constructor in a more optimised way.

While it's not clear how much this impacts performance in YOUR code or on YOUR platform, it's clear that moving them around is good practice, even if just to make the ownership explicit.

Collectives™ on Stack Overflow

Cost of copy vs move std::shared_ptr

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related