std::vector vs normal array

Question

I am creating a program which needs to be ultra-fast. It is running some stuff on the GPU using CUDA and afterwards it does some calculations on the CPU. For this, I need to convert the highly optimized GPU-datastructure to something that I can easily use on the CPU. My data is basically a graph laid out in a grid. Currently I am using std::vector for the CPU part. Because I know there is quite an overhead if I do a lot of push_back()s and I at least know because I know how many vertices I have in my graph, I now use the following code for this:

new_graph.resize(blockSize * blockSize); for (unsigned long long y = 0; y < blockSize; y++) { for (unsigned long long x = 0; x < blockSize; x++) { int idx = y * blockSize + x; new_graph[idx] = Vertex(x, y); } }

Afterwards I add the edges. Unfortunately I do not know how many edges per vertex I have, but I do know that it will never be bigger than 8. Therefore I reserve() 8 in each std::vector that I use for the edges.

However, this both seem to be extremely slow. If I use a normal array for the graph itself (so basically replacing the outer std::vector), the speed improvement in that part is enormous (like 10x or so).

For the graph this is doable, but for the edges not really, because I do some post-procsesing on these edges and for this I really need something like std::vector which is kinda dynamic (I add some edges).

Currently converting the data to std::vector's is something like 10 time slower than running my algorithm on the GPU (which is a smart MST algorithm). This is not really what I want, because now the overhead is way too big.

Does someone know what is going on or how I can fix this?

p.s. I compile with -O2, because I already found out that that can make a big difference. Also tried with -O3, no real difference.

Vertex is defined as follows:

struct Pos { int x, y; Pos() { x = 0; y = 0; } Pos(int x, int y) { this->x = x; this->y = y; } }; struct Vertex { Pos pos; bool hidden; unsigned long long newIdx; Vertex() { this->pos = Pos(); this->hidden = false; this->numEdges = 0; this->numRemovedEdges = 0; } Vertex(Pos &pos) { this->pos = pos; this->hidden = false; this->numEdges = 0; this->numRemovedEdges = 0; } Vertex(int x, int y) { this->pos = Pos(x, y); this->hidden = false; this->numEdges = 0; this->numRemovedEdges = 0; } int numEdges; int numRemovedEdges; std::vector<Edge> edges; std::vector<bool> removed; std::vector<bool> doNotWrite; };

Try to compile with -O3 which will inline some functions (99.999% chance it will inline push_back, and if it does not then the implementation or compiler is a piece of crap). — user1203803
– user1203803, Commented Apr 4, 2012 at 14:33
Calling reserve instead of resize and then using push_back instead of [] will avoid redundant initialization performed by resize. I don't know if that's the cause of the 10x slowdown (I doubt it accounts for everything), but it should certainly help. — R. Martinho Fernandes
– R. Martinho Fernandes, Commented Apr 4, 2012 at 14:41

Branko Dimitrijevic · Accepted Answer · 2012-04-04 17:21:55Z

Perhaps you are paying for a dynamic memory allocation that vector does to reserve the space for its elements?

Even if you reserve optimally, you'll have at least 3 memory allocations for each and every Vertex (one for edges, one for removed and one for doNotWrite). Dynamic memory allocation is potentially expensive relative to high-performance stuff you are trying to do here.

Either use plain old arrays that are guaranteed to be large enough (potentially wasting space), or a specialized memory allocator together with vector, tailored to your specific needs.

Also, do you access the elements in memory order? Your example seems to suggest so, but do you do it in all cases?

Also, do you even need Vertex.pos? Can't it be inferred from the Vertex's position in the grid?

I am now working on plain old arrays, think that will make a difference. I don't always access them in order and Vertex.pos is necessary because I later remove nodes from my structure so then I can not use the grid's position anymore.
At the end I decided to create my own array, which improved the speed

Greg Smith · Accepted Answer · 2012-04-05 03:17:51Z

The CPU data structure is extremely inefficient due to the number of dynamic memory allocations, unnecessary assignment operations, and overall size of each Vertex. Before considering optimizing this structure it would be good to understand the data flow between the CPU data structures and the GPU data structures as conversion between the two formats is likely to take a lot of time. This begs the question, why is the GPU structure not used on the CPU side?

If you were only looking at this from the CPU side and you want to maintain a AoS data structure then 1. Simplify the Vertex data structure. 2. Remove all dynamic memory allocation. Each std::vector will do a dynb 3. Replace removed and doNotWrite to std::bitset<8>. 4. Remove numRemoveEdges. This is removed.count(). 5. If Edge is small then you may find it faster to declare Edge edges[8]. 6. If you decide to stay with vector then consider using a pool allocator. 7. Reorder the data elements in Vertex by size to reduced the sizeof Vertex.

All of these recommendations are very likely not the best solution for sharing the data with a GPU. If you do use a pool allocator and you use UVA (CUDA Linux) you can simply copy the data to the GPU with a single memory copy.

Alex Z · Accepted Answer · 2012-04-05 04:58:55Z

There is another solution I used recently in similar situation. In llvm package there is SmallVector class. It provides interface that is quite similar to std::vector, but it allows keeping some fixed number of elements in-line (so unless vector grows above that initial limit no additional memory allocations occur). If SmallVector tries to grow above that initial size then memory block is allocated, and all items are moved there - all in one transparent step.

Few things that I had to fix in this SmallVector:

Smallest number of items that could be put in-place is 2, so when 1 item is used in e.g. 99.99% cases there is a quite overhead
Usual usage of swap() to free memory ( SmallVector().swap(vec) ) does not free memory, so I had to implement it myself

Just look up for the latest version of llvm for source code of SmallVector class

Roel · Accepted Answer · 2012-04-04 16:09:31Z

Can't you create one Vertex object, memcpy the x and y values into it (so that you don't have to call the constructor for each loop), then memcpy the whole Vertex into your std::vector? The vector's memory is guaranteed to be laid out like a regular array, so you can bypass all the abstraction and manipulate memory directly. No need for complicated stuff. Also, maybe you can layout the data you get back from the GPU in such a way that you can memcpy whole blocks at once, saving you even more.

Collectives™ on Stack Overflow

std::vector vs normal array

4 Answers 4

2 Comments

1 Comment

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

Comments

1 Comment

Related