3

I was accidentally surprised to found that inserting sorted keys into std::set is much much faster than inserting shuffled keys. This is somewhat counterintuitive since a red-black tree (I verified that std::set is implemented as a red-black tree on my system) as a self-balanced binary search tree, would need to do a lot of rebalancing opeartions to insert a sequence of sorted keys, thus inserting sorted keys should take more time than inserting shuffled keys.

But the fact is, inserting sorted keys can be up to 15 times faster than inserting shuffled keys! Here is my testing code and some results:

#include <algorithm> #include <chrono> #include <iostream> #include <random> #include <set> #include <vector> using namespace std; int64_t insertion_time(const vector<int> &keys) { auto start = chrono::system_clock::now(); set<int>(keys.begin(), keys.end()); auto stop = chrono::system_clock::now(); auto elapsed = chrono::duration_cast<chrono::milliseconds>(stop - start); return elapsed.count(); } int main() { size_t test_size; cout << "test size: "; cin >> test_size; vector<int> keys(test_size); for (int i = 0; i < test_size; ++i) { keys[i] = i; } // whether shuffled case or sorted case took first was irrelevant and results were similar auto rng = std::default_random_engine {}; shuffle(keys.begin(), keys.end(), rng); cout << "shuffled: " << insertion_time(keys) << endl; sort(keys.begin(), keys.end()); cout << "sorted: " << insertion_time(keys) << endl; return 0; } 
// i7 8700, 32 GB RAM, WIN10 2004, g++ -O3 main.cpp // An interesting observation is that the difference becomes larger as test_size being larger. // Similar results showed up for my handwritten red-black tree and other // machines( or other compilers, operating systems etc) C:\Users\Leon\Desktop\testSetInsertion>a test size: 1000000 shuffled: 585 sorted: 96 C:\Users\Leon\Desktop\testSetInsertion>a test size: 3000000 shuffled: 2480 sorted: 296 C:\Users\Leon\Desktop\testSetInsertion>a test size: 5000000 shuffled: 4805 sorted: 484 C:\Users\Leon\Desktop\testSetInsertion>a test size: 10000000 shuffled: 11537 sorted: 977 C:\Users\Leon\Desktop\testSetInsertion>a test size: 30000000 shuffled: 46239 sorted: 3076 

Anyone explains this please? I guess that this has something to do with cache locality since when inserting sorted keys, rebalancing typically involves those nodes that were most recently inserted. But above is just my guess and I know very little about cache locality.

1 Answer 1

5

If you look at https://en.cppreference.com/w/cpp/container/set/set

you can see:

Complexity
[..]
2) N log(N) where N = std::distance(first, last) in general, linear in N if the range is already sorted by value_comp().

We can use insert in loop with end() as hint which is amortized constant with correct hint.

Sign up to request clarification or add additional context in comments.

8 Comments

But I still can't see why it is the case and what it has to do with hint. The ctor of set knows nothing about the sortedness of the keys.
If it is sorted, there is only one check to do find the position of the new element (is last position ok). When wrong, binary search (so log(N)) is done to find position.
@LeonCruz checking if a sequence is sorted is very cheap (O(N)) compared to actually sorting (O(N log(N))
@Jarod42 But the results were similar for my handwritten red-black tree, which always searches from the root when inserting keys. Also, I just found that unordered_set gives similar results, though the difference is smaller and the type int might play an important role in this case
@largest_prime_is_463035818 But what has it to do with checking sortedness?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.