4

The code below calculates the dot product of two vectors a and b. The correct result is 8192. When I run it for the first time the result is correct. Then when I run it for the second time the result is the previous result + 8192 and so on:

1st iteration: result = 8192 2nd iteration: result = 8192 + 8192 3rd iteration: result = 8192 + 8192 and so on. 

I checked by printing it on screen and the device variable dev_c is not freed. What's more writing to it causes something like a sum, the result beeing the previous value plus the new one being written to it. I guess that could be something with the atomicAdd() operation, but nonetheless cudaFree(dev_c) should erase it after all.

#define N 8192 #define THREADS_PER_BLOCK 512 #define NUMBER_OF_BLOCKS (N/THREADS_PER_BLOCK) #include <stdio.h> __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i= 0; i< THREADS_PER_BLOCK; i++ ){ sum += temp[i]; } atomicAdd(c,sum); } } int main( void ) { int *a, *b, *c; int *dev_a, *dev_b, *dev_c; int size = N * sizeof( int); cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof(int)); a = (int*)malloc(size); b = (int*)malloc(size); c = (int*)malloc(sizeof(int)); for(int i = 0 ; i < N ; i++){ a[i] = 1; b[i] = 1; } cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice); dot<<< N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>( dev_a, dev_b, dev_c); cudaMemcpy( c, dev_c, sizeof(int) , cudaMemcpyDeviceToHost); printf("Dot product = %d\n", *c); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); free(a); free(b); free(c); return 0; } 
0

2 Answers 2

12

cudaFree doesn't erase anything, it simply returns memory to a pool to be re-allocated. cudaMalloc doesn't guarantee the value of memory that has been allocated. You need to initialize memory (both global and shared) that your program uses, in order to have consistent results. The same is true for malloc and free, by the way.

Sign up to request clarification or add additional context in comments.

4 Comments

I initialized the shared variable within the kernel and it works like charm. Thanks!
@ZviBar I am having the same issue. Can you please tell me how you initialized within the kernel ? Doesn't every thread reset that value if we initialize inside the kernel ? Thank you in advance.
@RajindRuparathna that was over three years ago and I haven't touched Cuda since then. Sorry.
Doing a cudaMemcpy using a initialized array worked for me. cudaMemcpy(dev_c, c, size,cudaMemcpyHostToDevice); where c in an array initialized to zero using a for loop.
6

From the documentation of cudaMalloc();

The memory is not cleared.

That means that dev_c is not initialized, and your atomicAdd(c,sum); will add to any random value that happens to be stored in memory at the returned position.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.