I am creating a program which needs to be ultra-fast. It is running some stuff on the GPU using CUDA and afterwards it does some calculations on the CPU. For this, I need to convert the highly optimized GPU-datastructure to something that I can easily use on the CPU. My data is basically a graph laid out in a grid. Currently I am using std::vector for the CPU part. Because I know there is quite an overhead if I do a lot of push_back()s and I at least know because I know how many vertices I have in my graph, I now use the following code for this:
new_graph.resize(blockSize * blockSize); for (unsigned long long y = 0; y < blockSize; y++) { for (unsigned long long x = 0; x < blockSize; x++) { int idx = y * blockSize + x; new_graph[idx] = Vertex(x, y); } } Afterwards I add the edges. Unfortunately I do not know how many edges per vertex I have, but I do know that it will never be bigger than 8. Therefore I reserve() 8 in each std::vector that I use for the edges.
However, this both seem to be extremely slow. If I use a normal array for the graph itself (so basically replacing the outer std::vector), the speed improvement in that part is enormous (like 10x or so).
For the graph this is doable, but for the edges not really, because I do some post-procsesing on these edges and for this I really need something like std::vector which is kinda dynamic (I add some edges).
Currently converting the data to std::vector's is something like 10 time slower than running my algorithm on the GPU (which is a smart MST algorithm). This is not really what I want, because now the overhead is way too big.
Does someone know what is going on or how I can fix this?
p.s. I compile with -O2, because I already found out that that can make a big difference. Also tried with -O3, no real difference.
Vertex is defined as follows:
struct Pos { int x, y; Pos() { x = 0; y = 0; } Pos(int x, int y) { this->x = x; this->y = y; } }; struct Vertex { Pos pos; bool hidden; unsigned long long newIdx; Vertex() { this->pos = Pos(); this->hidden = false; this->numEdges = 0; this->numRemovedEdges = 0; } Vertex(Pos &pos) { this->pos = pos; this->hidden = false; this->numEdges = 0; this->numRemovedEdges = 0; } Vertex(int x, int y) { this->pos = Pos(x, y); this->hidden = false; this->numEdges = 0; this->numRemovedEdges = 0; } int numEdges; int numRemovedEdges; std::vector<Edge> edges; std::vector<bool> removed; std::vector<bool> doNotWrite; };
-O3which will inline some functions (99.999% chance it will inlinepush_back, and if it does not then the implementation or compiler is a piece of crap).reserveinstead ofresizeand then usingpush_backinstead of[]will avoid redundant initialization performed byresize. I don't know if that's the cause of the 10x slowdown (I doubt it accounts for everything), but it should certainly help.