0

I want to run the following simple code on two GPUs simultaneously. Here I have a variable A[i]=[0 1 2 3 4 5 6 7 8 9] and want to calculate C[i]=A[i+1]+A[i]+A[i-1]. This is the answer: C[i]=[1 3 6 9 7 11 18 21 24 17]. Bold numbers are wrong. For two devices, C[4] from device=1 needs to access to A[5] from device=2. How can I do it in the simplest way?

My expertise is not programming and I suppose to use multiGPU to solve a PDE equation. So, I really appreciate any help to modify this code for my current problem.

Thank you.

#include <stdio.h> #include <assert.h> #include <cuda_runtime.h> #include <stdlib.h> #include<time.h> __global__ void iKernel(float *A, float *C, const int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) C[i] = A[i-1] + A[i] + A[i+1]; } int main(int argc, char **argv) { int ngpus; printf("> starting %s", argv[0]); cudaGetDeviceCount(&ngpus); printf(" CUDA-capable devices: %i\n", ngpus); ngpus = 2; int size = 10; int iSize = size / ngpus; size_t iBytes = iSize * sizeof(float); printf("> total array size %d M, using %d devices with each device " "handling %d M\n", size / 1024 / 1024, ngpus, iSize / 1024 / 1024); // allocate device memory float **d_A = (float **)malloc(sizeof(float *) * ngpus); float **d_C = (float **)malloc(sizeof(float *) * ngpus); float **h_A = (float **)malloc(sizeof(float *) * ngpus); float **gpuRef = (float **)malloc(sizeof(float *) * ngpus); cudaStream_t *stream = (cudaStream_t *)malloc(sizeof(cudaStream_t) * ngpus); for (int i = 0; i < ngpus; i++){ // set current device cudaSetDevice(i); // allocate device memory cudaMalloc((void **)&d_A[i], iBytes); cudaMalloc((void **)&d_C[i], iBytes); // allocate page locked host memory for asynchronous data transfer cudaMallocHost((void **)&h_A[i], iBytes); cudaMallocHost((void **)&gpuRef[i], iBytes); // create streams for timing and synchronizing cudaStreamCreate(&stream[i]); } dim3 block(512); dim3 grid((iSize + block.x - 1) / block.x); //h_A[ngpus][index] for (int i = 0; i < ngpus; i++){ cudaSetDevice(i); for (int j = 0; j < iSize; j++){ h_A[i][j] = j + i*iSize; printf("%d %d %d %0.8f \n", i,j,iSize, h_A[i][j]); } } // record start time double iStart = clock(); // distributing the workload across multiple devices for (int i = 0; i < ngpus; i++){ cudaSetDevice(i); cudaMemcpyAsync(d_A[i], h_A[i], iBytes, cudaMemcpyHostToDevice, stream[i]); iKernel << <grid, block, 0, stream[i] >> >(d_A[i], d_C[i], iSize); cudaMemcpyAsync(gpuRef[i], d_C[i], iBytes, cudaMemcpyDeviceToHost, stream[i]); } // synchronize streams for (int i = 0; i < ngpus; i++){ cudaSetDevice(i); cudaStreamSynchronize(stream[i]); } for (int i = 0; i < ngpus; i++){ for (int j = 0; j < iSize; j++){ printf("%d %d %0.8f \n", i,j,gpuRef[i][j]); } } return EXIT_SUCCESS; } 
4
  • 1
    In your kernel, what do you expect the behavior to be when i=0 ? What do you expect the code should do when accessing A[i-1] if i=0 ? To be clear, I don't think this question that I am asking has much of anything to do with CUDA. I'm not suggesting that is the only issue with your code. Another thing I would suggest is that you demonstrate the use of proper CUDA error checking and also run your code with cuda-memcheck. The error output may be instructive for you, and useful for those who are trying to help you. Finally, you may wish to review the CUDA simpleMultiGPU sample code. Commented Aug 4, 2020 at 20:04
  • Robert, Thank you for your quick reply. I know for i=0 and i=9, kernel needs to be modified. But my main problem is C[4] and C[5] which are located on the boundary of each device. Commented Aug 4, 2020 at 20:10
  • 1
    As described in the answer, you have a few options. 1. Use pinned allocations instead of cudaMalloc. 2. Use managed memory 3. Explicitly copy boundary regions between GPUs in between kernel launches. 4. If system topology supports, put both GPUs into a peer relationship. Then the kernel on one GPU can read directly from the memory of the other GPU, over the intervening bus (PCIE, or NVLink). Commented Aug 4, 2020 at 20:12
  • This may be of interest. Commented Aug 4, 2020 at 20:24

1 Answer 1

1

You have to upload the overlap regions to both devices. You can't (easily) read values from another device, so you have to duplicate and pad at least some of the input values as required. iSize is obviously not enough input size when accessing iSize + 2 different input values.

If this were a multi pass algorithm, you would need to explicitly perform a copy of relevant regions in between passes.

Try modeling data dependencies formally on paper when attempting to target multi GPU systems.

Both GPUs can access memory allocated with cudaMallocHost, but it's usually not advisable to use that memory type as performance over PCIe bus is pretty bad compared to device local memory. There is also driver managed memory, but that isn't suited for two GPUs sharing the same active working set either.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.