I have two kernels for doing a matrix multiplication, one uses global memory and the second one uses constant memory. I wanted to use the Cuda profiler to test the speed of both kernels.
I tested both on a 1.3 device and on a 2.0 device. I was expecting the kernel with constant memory to be faster on the 1.3 device and the global memory kernel to be faster on the 2.0 device because of the use of cache for global memory on those devices but I found that in both devices the global memory kernel is faster. Is this due to memory coalescing on global memory? If so is there a way to make the constant kernel faster?
I'm using matrixes of 80x80 and Block size of 16.
Here is the global memory kernel
__global__ void MatMulGlobKernel(const Matriz A, const Matriz B, Matriz C) { float Cvalor = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(fil > A.height || col > B.width) return; for (int e = 0; e < A.width; ++e) Cvalor += A.valores[row * A.width + e] * B.valores[e * B.width + col]; C.valores[row * C.width + col] = Cvalor; } A.valores, B.valores and C.valores reside in global memory.
Now here is the constant memory kernel.
__global__ void MatMulConstKernel(const Matriz A, const Matriz B, Matriz C) { float Cvalor = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if(fil > A.height || col > B.width) return; for (int e = 0; e < A.width; ++e) Cvalor += A_const_valores[row * A.width + e] * B_const_valores[e * B.width + col]; C.valores[row * C.width + col] = Cvalor; } A_const_valores and B_const_valores reside in constant memory while C.valores resides in global memory.
This is the profiler result for the 1.3 device (Tesla M1060)
Const kernel 101.70us
Global kernel 51.424us
and for the 2.0 device (GTX 650)
Const kernel 178.05us
Global kernel 58.144us