Copying a struct containing pointers to CUDA device

Question

I'm working on a project where I need my CUDA device to make computations on a struct containing pointers.

typedef struct StructA { int* arr; } StructA;

When I allocate memory for the struct and then copy it to the device, it will only copy the struct and not the content of the pointer. Right now I'm working around this by allocating the pointer first, then set the host struct to use that new pointer (which resides on the GPU). The following code sample describes this approach using the struct from above:

#define N 10 int main() { int h_arr[N] = {1,2,3,4,5,6,7,8,9,10}; StructA *h_a = (StructA*)malloc(sizeof(StructA)); StructA *d_a; int *d_arr; // 1. Allocate device struct. cudaMalloc((void**) &d_a, sizeof(StructA)); // 2. Allocate device pointer. cudaMalloc((void**) &(d_arr), sizeof(int)*N); // 3. Copy pointer content from host to device. cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice); // 4. Point to device pointer in host struct. h_a->arr = d_arr; // 5. Copy struct from host to device. cudaMemcpy(d_a, h_a, sizeof(StructA), cudaMemcpyHostToDevice); // 6. Call kernel. kernel<<<N,1>>>(d_a); // 7. Copy struct from device to host. cudaMemcpy(h_a, d_a, sizeof(StructA), cudaMemcpyDeviceToHost); // 8. Copy pointer from device to host. cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost); // 9. Point to host pointer in host struct. h_a->arr = h_arr; }

My question is: Is this the way to do it?

It seems like an awful lot of work, and I remind you that this is a very simple struct. If my struct contained a lot of pointers or structs with pointers themselves, the code for allocation and copy will be quite extensive and confusing.

Steps 7 and 9 are redundant, but otherwise that is pretty much how it is. As the answer below says, you are best served by avoiding complex, pointer based data structures on the GPU. The performance is on the GPU is worse, and the APIs really are not designed for it. — talonmies
– talonmies, Commented Feb 16, 2012 at 11:01
well h_a is (or should be) an "image" of the device structure held in host memory. Assigning it to hold a pointer in host memory is probably some combination of bad practice/wrong/device memory leak depending on what your true intentions are. After you have copied the contents of d_a back to h_a you have "come full circle" and are back where you started from. — talonmies
– talonmies, Commented Feb 16, 2012 at 11:57
But in order to copy the struct correctly to the device I must set the pointer of h_a to d_arr (step 4). So when I copy the data back, I also have to set the pointer in h_a to the array I just copied it to. I agree that step 7 is redundant in my example above because there is no other information held in the struct, but if there was that step wouldn't be redundant.. Or am I completely mistaken? — Thorkil Holm-Jacobsen
– Thorkil Holm-Jacobsen, Commented Feb 16, 2012 at 12:11
Thank you, tahatmat, for providing us with this pattern of copying structures across host and device memories back and forth. However I believe it's just worth mentioning a second way, which seems somewhat more consistent and helps to avoid implementing step 9. The specific of function cudaMemcpy() actually permits dereferencing device pointers in host code in such fashion: you skip step 4 and after copying h_a to d_a on step 5, you manually copy each device pointer address into d_a, like this: cudaMemcpy(&(d_a->arr), &(d_arr), sizeof(int*), cudaMemcpyHostToDevice). Again, "d_a->arr" is legit — vitrums
– vitrums, Commented Nov 22, 2013 at 18:49

harrism · Accepted Answer · 2014-01-21 05:29:14Z

Edit: CUDA 6 introduces Unified Memory, which makes this "deep copy" problem a lot easier. See this post for more details.

Don't forget that you can pass structures by value to kernels. This code works:

// pass struct by value (may not be efficient for complex structures) __global__ void kernel2(StructA in) { in.arr[threadIdx.x] *= 2; }

Doing so means you only have to copy the array to the device, not the structure:

int h_arr[N] = {1,2,3,4,5,6,7,8,9,10}; StructA h_a; int *d_arr; // 1. Allocate device array. cudaMalloc((void**) &(d_arr), sizeof(int)*N); // 2. Copy array contents from host to device. cudaMemcpy(d_arr, h_arr, sizeof(int)*N, cudaMemcpyHostToDevice); // 3. Point to device pointer in host struct. h_a.arr = d_arr; // 4. Call kernel with host struct as argument kernel2<<<N,1>>>(h_a); // 5. Copy pointer from device to host. cudaMemcpy(h_arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost); // 6. Point to host pointer in host struct // (or do something else with it if this is not needed) h_a.arr = h_arr;

Vitality · Accepted Answer · 2017-04-10 17:19:37Z

As pointed out by Mark Harris, structures can be passed by values to CUDA kernels. However, some care should be devoted to set up a proper destructor since the destructor is called at exit from the kernel.

Consider the following example

#include <stdio.h> #include "Utilities.cuh" #define NUMBLOCKS 512 #define NUMTHREADS 512 * 2 /***************/ /* TEST STRUCT */ /***************/ struct Lock { int *d_state; // --- Constructor Lock(void) { int h_state = 0; // --- Host side lock state initializer gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state } // --- Destructor (wrong version) //~Lock(void) { // printf("Calling destructor\n"); // gpuErrchk(cudaFree(d_state)); //} // --- Destructor (correct version) // __host__ __device__ ~Lock(void) { //#if !defined(__CUDACC__) // gpuErrchk(cudaFree(d_state)); //#else // //#endif // } // --- Lock function __device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); } // --- Unlock function __device__ void unlock(void) { atomicExch(d_state, 0); } }; /**********************************/ /* BLOCK COUNTER KERNEL WITH LOCK */ /**********************************/ __global__ void blockCounterLocked(Lock lock, int *nblocks) { if (threadIdx.x == 0) { lock.lock(); *nblocks = *nblocks + 1; lock.unlock(); } } /********/ /* MAIN */ /********/ int main(){ int h_counting, *d_counting; Lock lock; gpuErrchk(cudaMalloc(&d_counting, sizeof(int))); // --- Locked case h_counting = 0; gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice)); blockCounterLocked << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting); gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost)); printf("Counting in the locked case: %i\n", h_counting); gpuErrchk(cudaFree(d_counting)); }

with the uncommented destructor (do not pay too much attention on what the code actually does). If you run that code, you will receive the following output

Calling destructor Counting in the locked case: 512 Calling destructor GPUassert: invalid device pointer D:/Project/passStructToKernel/passClassToKernel/Utilities.cu 37

There are then two calls to the destructor, once at the kernel exit and once at the main exit. The error message is related to the fact that, if the memory locations pointed to by d_state are freed at the kernel exit, they cannot be freed anymore at the main exit. Accordingly, the destructor must be different for host and device executions. This is accomplished by the commented destructor in the above code.

martiert · Accepted Answer · 2012-02-16 11:31:26Z

-3

struct of arrays is a nightmare in cuda. You will have to copy each of the pointer to a new struct which the device can use. Maybe you instead could use an array of structs? If not the only way I have found is to attack it the way you do, which is in no way pretty.

EDIT: since I can't give comments on the top post: Step 9 is redundant, since you can change step 8 and 9 into

// 8. Copy pointer from device to host. cudaMemcpy(h->arr, d_arr, sizeof(int)*N, cudaMemcpyDeviceToHost);

edited Feb 16, 2012 at 11:31

answered Feb 16, 2012 at 10:27

martiert

7414 silver badges13 bronze badges

4 Comments

harrism Over a year ago

First, this answer is dangerous because it goes against the standard wisdom about AOS/SOA in parallel computing. Structure of Arrays (SOA) is preferable over Array of Structures (AOS) in all parallel computing, including multicore CPUs with SSE/AVX instruction sets. The reason is that SOA maintains locality of reference across threads (e.g. adjacent elements of d_a.arr are accessed by adjacent threads which are running concurrently). A structure with a pointer in it is NOT the same thing as Structure of Arrays. Second, you can simplify this code by passing the structure by value.

BugShotGG Over a year ago

@harrism Why is Array of Structs not preferable in cuda? I dont understand this, can you give me an example or a link? Thanks

Robert Crovella Over a year ago

@GeoPapas here is a question/answer that discusses SOA vs. AOS with examples.

BugShotGG Over a year ago

@RobertCrovella Thanks for the reply Robert but I have already made a question about it Here and the answers were pretty clear. :)

Collectives™ on Stack Overflow

Copying a struct containing pointers to CUDA device

3 Answers 3

Comments

Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

4 Comments

Linked

Related