Why is inserting sorted keys into std::set so much faster than inserting shuffled keys?

Question

I was accidentally surprised to found that inserting sorted keys into std::set is much much faster than inserting shuffled keys. This is somewhat counterintuitive since a red-black tree (I verified that std::set is implemented as a red-black tree on my system) as a self-balanced binary search tree, would need to do a lot of rebalancing opeartions to insert a sequence of sorted keys, thus inserting sorted keys should take more time than inserting shuffled keys.

But the fact is, inserting sorted keys can be up to 15 times faster than inserting shuffled keys! Here is my testing code and some results:

#include <algorithm> #include <chrono> #include <iostream> #include <random> #include <set> #include <vector> using namespace std; int64_t insertion_time(const vector<int> &keys) { auto start = chrono::system_clock::now(); set<int>(keys.begin(), keys.end()); auto stop = chrono::system_clock::now(); auto elapsed = chrono::duration_cast<chrono::milliseconds>(stop - start); return elapsed.count(); } int main() { size_t test_size; cout << "test size: "; cin >> test_size; vector<int> keys(test_size); for (int i = 0; i < test_size; ++i) { keys[i] = i; } // whether shuffled case or sorted case took first was irrelevant and results were similar auto rng = std::default_random_engine {}; shuffle(keys.begin(), keys.end(), rng); cout << "shuffled: " << insertion_time(keys) << endl; sort(keys.begin(), keys.end()); cout << "sorted: " << insertion_time(keys) << endl; return 0; }

// i7 8700, 32 GB RAM, WIN10 2004, g++ -O3 main.cpp // An interesting observation is that the difference becomes larger as test_size being larger. // Similar results showed up for my handwritten red-black tree and other // machines( or other compilers, operating systems etc) C:\Users\Leon\Desktop\testSetInsertion>a test size: 1000000 shuffled: 585 sorted: 96 C:\Users\Leon\Desktop\testSetInsertion>a test size: 3000000 shuffled: 2480 sorted: 296 C:\Users\Leon\Desktop\testSetInsertion>a test size: 5000000 shuffled: 4805 sorted: 484 C:\Users\Leon\Desktop\testSetInsertion>a test size: 10000000 shuffled: 11537 sorted: 977 C:\Users\Leon\Desktop\testSetInsertion>a test size: 30000000 shuffled: 46239 sorted: 3076

Anyone explains this please? I guess that this has something to do with cache locality since when inserting sorted keys, rebalancing typically involves those nodes that were most recently inserted. But above is just my guess and I know very little about cache locality.

Jarod42 · Accepted Answer · 2021-02-23 11:31:46Z

5

If you look at https://en.cppreference.com/w/cpp/container/set/set

you can see:

Complexity
[..]
2) N log(N) where N = std::distance(first, last) in general, linear in N if the range is already sorted by value_comp().

We can use insert in loop with end() as hint which is amortized constant with correct hint.

answered Feb 23, 2021 at 11:31

Jarod42

227k15 gold badges214 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Leon Cruz Over a year ago

But I still can't see why it is the case and what it has to do with hint. The ctor of set knows nothing about the sortedness of the keys.

Jarod42 Over a year ago

If it is sorted, there is only one check to do find the position of the new element (is last position ok). When wrong, binary search (so log(N)) is done to find position.

463035818_is_not_an_ai Over a year ago

@LeonCruz checking if a sequence is sorted is very cheap (O(N)) compared to actually sorting (O(N log(N))

Leon Cruz Over a year ago

@Jarod42 But the results were similar for my handwritten red-black tree, which always searches from the root when inserting keys. Also, I just found that unordered_set gives similar results, though the difference is smaller and the type int might play an important role in this case

Leon Cruz Over a year ago

@largest_prime_is_463035818 But what has it to do with checking sortedness?

|

Collectives™ on Stack Overflow

Why is inserting sorted keys into std::set so much faster than inserting shuffled keys?

1 Answer 1

8 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Related