5

I was doing some quick benchmark tests involving a std:vector. I start with a relatively small vector of 100 ints and call various methods for filling it with 1,000,000 ints. Most of my functions involve clearing the elements and adding the elements again or creating a new vector and moving or swapping it with the original vector. I also have a function that just resizes the vector and overwrites the elements.

You can see the functions in the code below. What's interesting is that resizing the vector and overwriting the elements is by far quickest. I thought that reserving the memory before pushing the elements would improve performance.

I know that std::vector::resize() will resize the vector to contain the new count. According to cppreference:

If the current size is less than count, additional elements are appended and initialized with copies of value.

resize() should be constructing 100 less ints than the other functions. So I'm surprised by the difference in speed. I thought resize() would allocate and initialize the new elements while reserve would just allocate the memory.

#include <algorithm> #include <chrono> #include <iostream> constexpr int InitialSize = 100; constexpr int NewSize = 1000000; void overwrite(std::vector<int>& v) { v.resize(NewSize); for (int i = 0; i < NewSize; ++i) { v[i] = i; } } void clear(std::vector<int>& v) { v.clear(); v.reserve(NewSize); for (int i = 0; i < NewSize; ++i) { v.push_back(i); } } void swap(std::vector<int> &v) { std::vector<int> vnew; vnew.reserve(NewSize); for (int i = 0; i < NewSize; ++i) { vnew.push_back(i); } v.swap(vnew); } void move(std::vector<int> &v) { std::vector<int> vnew; vnew.reserve(NewSize); for (int i = 0; i < NewSize; ++i) { vnew.push_back(i); } v = std::move(vnew); } int main() { { std::vector<int> v(InitialSize); std::iota(v.begin(), v.end(), 1); auto start = std::chrono::high_resolution_clock::now(); move(v); auto finish = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> elapsed = finish - start; std::cout << "Move - elapsed time: " << elapsed.count() << " ms" << std::endl; } { std::vector<int> v(InitialSize); std::iota(v.begin(), v.end(), 1); auto start = std::chrono::high_resolution_clock::now(); clear(v); auto finish = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> elapsed = finish - start; std::cout << "Clear - elapsed time: " << elapsed.count() << " ms" << std::endl; } { std::vector<int> v(InitialSize); std::iota(v.begin(), v.end(), 1); auto start = std::chrono::high_resolution_clock::now(); swap(v); auto finish = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> elapsed = finish - start; std::cout << "Swap - elapsed time: " << elapsed.count() << " ms" << std::endl; } { std::vector<int> v(InitialSize); std::iota(v.begin(), v.end(), 1); auto start = std::chrono::high_resolution_clock::now(); overwrite(v); auto finish = std::chrono::high_resolution_clock::now(); std::chrono::duration<double, std::milli> elapsed = finish - start; std::cout << "Overwrite - elapsed time: " << elapsed.count() << " ms" << std::endl; } return 0; } 

Output:

Move - elapsed time: 14.6284 ms Clear - elapsed time: 17.5072 ms Swap - elapsed time: 12.9111 ms Overwrite - elapsed time: 5.19079 ms 

LIVE

Quick Bench results.

Can someone explain what's going on here?

10
  • It doesn't have to update its size member variable if you've resized first. That could have a minor effect. Commented Mar 7, 2020 at 23:16
  • 1
    resize can use some avx optimized memset to initialize members more efficiently and batched, while push_back will initialize them one by one. Commented Mar 7, 2020 at 23:22
  • I don't see any compilation flags. Is it optimized by default on godbolt? Commented Mar 7, 2020 at 23:25
  • 1
    You did compile with optimizations on, right? (ala -O3) ? Commented Mar 7, 2020 at 23:26
  • 1
    You are only testing int. A non-trivial type will trigger different optimisation paths and yield wildly different results, I suspect. Also, always repeat your timing experiments manifold to gain confidence in the results. As a rule of thumb you shouldn't repeat the experiment less than 20 times. And you should pre-heat the test as suggested by other comments here. Commented Mar 8, 2020 at 0:13

2 Answers 2

7

push_back is costlier operation than index based access even if allocation has been taken care of before hand by reserve.

  1. push_back will need to take care of end pointer so that vector size can be computed correctly
  2. push_back will check for realloaction need. Essentially a branch prediction.
  3. push_back will cause copy (or move) of value to be pushed back. In case of int, it shouldn't cause performance difference.

If you see assembly conversion (Taken from godbolt link given in question), index operation is non branching sequence of few moves and shift operation while push_back is much more involved. In long running loop (1000000 in given example), this difference will matter. Compiler optimization level can definitely impact the difference.

For index operator []

 push rbp mov rbp, rsp mov QWORD PTR [rbp-8], rdi mov QWORD PTR [rbp-16], rsi mov rax, QWORD PTR [rbp-8] mov rax, QWORD PTR [rax] mov rdx, QWORD PTR [rbp-16] sal rdx, 2 add rax, rdx pop rbp ret 

For push_back

 push rbp mov rbp, rsp sub rsp, 16 mov QWORD PTR [rbp-8], rdi mov QWORD PTR [rbp-16], rsi mov rax, QWORD PTR [rbp-8] mov rdx, QWORD PTR [rax+8] mov rax, QWORD PTR [rbp-8] mov rax, QWORD PTR [rax+16] cmp rdx, rax je .L73 mov rax, QWORD PTR [rbp-8] // When allocation is not needed mov rcx, QWORD PTR [rax+8] mov rax, QWORD PTR [rbp-8] mov rdx, QWORD PTR [rbp-16] mov rsi, rcx mov rdi, rax call void std::allocator_traits<std::allocator<int> >::construct<int, int const&>(std::allocator<int>&, int*, int const&) mov rax, QWORD PTR [rbp-8] mov rax, QWORD PTR [rax+8] lea rdx, [rax+4] mov rax, QWORD PTR [rbp-8] mov QWORD PTR [rax+8], rdx jmp .L75 .L73: // When allocation is needed mov rax, QWORD PTR [rbp-8] mov rdi, rax call std::vector<int, std::allocator<int> >::end() mov rcx, rax mov rdx, QWORD PTR [rbp-16] mov rax, QWORD PTR [rbp-8] mov rsi, rcx mov rdi, rax .L75: nop leave ret 
Sign up to request clarification or add additional context in comments.

Comments

2

Can someone explain what's going here?

overwrite is fundamentally different than the others, because you never call push_back which has to check for resize which makes the loop way more complex.

The other three are basically equivalent (minus constant time differences) and will perform differently depending on optimizations, how good the compiler does its job and the standard library implementation.

If you are very lucky, the optimizer may be able to see that the resizing will never happen and behave like overwrite.

3 Comments

Yes. push_back() does have to check size < capacity, otherwise do a reallocation. This is the main difference between overwrite() and the other methods. I was trying to avoid the reallocation by setting the capacity beforehand. Yet I would have thought the size check would have been a minor difference.
@jignatius The point is that even if the reallocation does not happen at runtime, the optimizer still sees all the code that handles that and may not be able to simplify the loop into the equivalent of overwrite's trivial loop. And if the optimizer doesn't arrive at the simplified loop. then it cannot apply the rest of the optimizations either.
Even if branch prediction is perfect resp. optimized away push_back() in case of an int essentially involves two assignments, i.e. should be roughly twice as expensive in the case of an int: It needs to update the end pointer (or size member)! (Of course a compiler could optimize that away as well; it could actually not do anything much at all except printing the time, since none of all the data shuffled around has an observable effect. The trouble with benchmarks.)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.