3

I am trying to run a matrix inversion from the device. This logic works fine if called from the host.

Compilation line is as follows (Linux):

nvcc -ccbin g++ -arch=sm_35 -rdc=true simple-inv.cu -o simple-inv -lcublas_device -lcudadevrt 

I get the following warning that I cannot seem to resolve. (My GPU is Kepler. I don't know why it is trying to link to Maxwell routines. I have Cuda 6.5-14):

nvlink warning : SM Arch ('sm_35') not found in '/usr/local/cuda/bin/../targets/x86_64-linux/lib/libcublas_device.a:maxwell_sm50_sgemm.o' 

The program runs with:

handle 0 n = 3 simple-inv.cu:63 Error [an illegal memory access was encountered] 

The test program is as follows:

#include <stdio.h> #include <stdlib.h> #include <math.h> #include <cuda_runtime.h> #include <cublas_v2.h> #define PERR(call) \ if (call) {\ fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ } #define ERRCHECK \ if (cudaPeekAtLastError()) { \ fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ } __global__ void inv_kernel(float *a_i, float *c_o, int n) { int p[3], info[1], batch; cublasHandle_t hdl; cublasStatus_t status = cublasCreate_v2(&hdl); printf("handle %d n = %d\n", status, n); info[0] = 0; batch = 1; float *a[] = {a_i}; const float *aconst[] = {a_i}; float *c[] = {c_o}; // See // http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device status = cublasSgetrfBatched(hdl, n, a, n, p, info, batch); __syncthreads(); printf("rf %d info %d\n", status, info[0]); status = cublasSgetriBatched(hdl, n, aconst, n, p, c, n, info, batch); __syncthreads(); printf("ri %d info %d\n", status, info[0]); cublasDestroy_v2(hdl); printf("done\n"); } static void run_inv(float *in, float *out, int n) { float *a_d, *c_d; PERR(cudaMalloc(&a_d, n*n*sizeof(float))); PERR(cudaMalloc(&c_d, n*n*sizeof(float))); PERR(cudaMemcpy(a_d, in, n*n*sizeof(float), cudaMemcpyHostToDevice)); inv_kernel<<<1, 1>>>(a_d, c_d, n); cudaDeviceSynchronize(); ERRCHECK; PERR(cudaMemcpy(out, c_d, n*n*sizeof(float), cudaMemcpyDeviceToHost)); PERR(cudaFree(a_d)); PERR(cudaFree(c_d)); } int main(int argc, char **argv) { float c[9]; float a[] = { 1, 2, 3, 0, 4, 5, 1, 0, 6 }; run_inv(a, c, 3); return 0; } 

I have followed the guide at http://docs.nvidia.com/cuda/cublas/index.html#device-api section 2.1.9, but I suspect I have overlooked something.

Note: Edited on 11/24 to use correct pointer inputs. This still reports illegal memory access inside the kernel.

9
  • Line 63 in the code you have posted is whitespace. Where exactly in the error occurring in the code? Commented Nov 23, 2014 at 21:42
  • Line 64 during device synch. I must have posted and older output. I suspect during the call to cublasSgetrfBatched. Commented Nov 23, 2014 at 21:47
  • (float**)a_i looks very suspicious. Surely you mean to pass the address of a_i and not its value? Commented Nov 23, 2014 at 22:03
  • 1
    @VAndrei: yes it is possible and your comment is completely irrelevant to the quesiton. Commented Nov 23, 2014 at 22:16
  • 2
    @Bob: The code you link to and your code are different, and the difference is that you have an illegal cast. *a[] = {a_i}; cublasSgetrfBatched(..., a, ....) and cublasSgetrfBatched(..., (float**)a_i, ...) are not equivalent, and if you think they are, then you need to revise the theory of pointers in C++. Commented Nov 23, 2014 at 22:22

2 Answers 2

5

NOTE: The ability to call cublas functions from device code has been removed from CUDA as of CUDA 10.0. The description in this answer only pertains to CUDA 9.x usage and prior. See here.

The warnings about sm_50 are benign. That's my way of saying "they can be safely ignored in this case".

Regarding the code you currently have posted, the problem relates to what is described in the dynamic parallelism documentation around the use of thread-local memory here.

In a nutshell, local memory of the parent thread is "out of scope" in a child kernel launch. Although it's not entirely obvious, the cublas calls from device code are (attempting) to launch child kernels. This means that declarations like this:

int p[3], info[1], 

will be problematic if those pointers (e.g. p, info) are passed to a child kernel. The numerical values of the pointers themselves will not be corrupted, but they will not point to anything "meaningful" in the memory space of the child kernel.

There are multiple ways to solve this, but one possible solution is to replace any stack/local allocations of this type with allocations from the "device heap" which can be made via in-kernel malloc.

Here is a fully worked code/example that seems to work correctly for me. The output seems to be correct for the inversion of the given sample matrix:

$ cat t605.cu #include <stdio.h> #include <stdlib.h> #include <math.h> #include <cuda_runtime.h> #include <cublas_v2.h> #define PERR(call) \ if (call) {\ fprintf(stderr, "%s:%d Error [%s] on "#call"\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ } #define ERRCHECK \ if (cudaPeekAtLastError()) { \ fprintf(stderr, "%s:%d Error [%s]\n", __FILE__, __LINE__,\ cudaGetErrorString(cudaGetLastError()));\ exit(1);\ } __global__ void inv_kernel(float *a_i, float *c_o, int n) { int *p = (int *)malloc(3*sizeof(int)); int *info = (int *)malloc(sizeof(int)); int batch; cublasHandle_t hdl; cublasStatus_t status = cublasCreate_v2(&hdl); printf("handle %d n = %d\n", status, n); info[0] = 0; batch = 1; float **a = (float **)malloc(sizeof(float *)); *a = a_i; const float **aconst = (const float **)a; float **c = (float **)malloc(sizeof(float *)); *c = c_o; // See // http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf //http://stackoverflow.com/questions/27094612/cublas-matrix-inversion-from-device status = cublasSgetrfBatched(hdl, n, a, n, p, info, batch); __syncthreads(); printf("rf %d info %d\n", status, info[0]); status = cublasSgetriBatched(hdl, n, aconst, n, p, c, n, info, batch); __syncthreads(); printf("ri %d info %d\n", status, info[0]); cublasDestroy_v2(hdl); printf("done\n"); } static void run_inv(float *in, float *out, int n) { float *a_d, *c_d; PERR(cudaMalloc(&a_d, n*n*sizeof(float))); PERR(cudaMalloc(&c_d, n*n*sizeof(float))); PERR(cudaMemcpy(a_d, in, n*n*sizeof(float), cudaMemcpyHostToDevice)); inv_kernel<<<1, 1>>>(a_d, c_d, n); cudaDeviceSynchronize(); ERRCHECK; PERR(cudaMemcpy(out, c_d, n*n*sizeof(float), cudaMemcpyDeviceToHost)); PERR(cudaFree(a_d)); PERR(cudaFree(c_d)); } int main(int argc, char **argv) { float c[9]; float a[] = { 1, 2, 3, 0, 4, 5, 1, 0, 6 }; run_inv(a, c, 3); for (int i = 0; i < 3; i++){ for (int j = 0; j < 3; j++) printf("%f, ",c[(3*i)+j]); printf("\n");} return 0; } $ nvcc -arch=sm_35 -rdc=true -o t605 t605.cu -lcublas_device -lcudadevrt nvlink warning : SM Arch ('sm_35') not found in '/shared/apps/cuda/CUDA-v6.5.14/bin/..//lib64/libcublas_device.a:maxwell_sgemm.asm.o' nvlink warning : SM Arch ('sm_35') not found in '/shared/apps/cuda/CUDA-v6.5.14/bin/..//lib64/libcublas_device.a:maxwell_sm50_sgemm.o' $ ./t605 handle 0 n = 3 rf 0 info 0 ri 0 info 0 done 1.090909, -0.545455, -0.090909, 0.227273, 0.136364, -0.227273, -0.181818, 0.090909, 0.181818, $ 

For CUDA 10.0 and newer users, I would suggest using ordinary batched cublas functions from host code. In particular for matrix inverse, one option is to use matinvBatched.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. This works for me. I had originally allocated the p and info variables, but had not realized that I also needed to allocate the a, aconst, and c variables. After reading the local memory reference section this makes sense. I would imagine that n is allocated to the global memory heap since it is part of the kernel call parameters. The handle variable probably does not apply.
Other parameters like n, batch, etc., are passed by value. Parameters passed by value have no reference back to the calling environment. This is characteristic of C/C++, not a unique CUDA concept. In fact even the pointers are passed "by value". But when these pointer values get dereferenced in the child kernel, bad things happen. For non-pointer parameters, no such dereferencing is done in the child kernel, and everything works fine. In fact, this pass-by-value occurs actually at the cublas function call (and again, later, at the child kernel launch that occurs under the hood.)
1

Could it be that some CUDA function you are running is only supported by different architecture (Even though the documentation says everything used is. I don't get the compiler warnings if I compile with -arch=sm_50. I don't have an sm_50 capable device to test though...

Furthermore those warnings look like some function asm is not available for your architecture so it was linked to a different architecture asm which is not supported by your device so you get some weird error. I think you should take this nvidia developers who understand more what their compiler is doing.

I have access to Compute 3.5 capable device, but unfortunately only with CUDA v 6.0 and using your example (slightly fixed, to compile (const float *) -> (float *) on line 42) and I don't get any compilation warnings (sadly same results though).

Also as mentioned in the comments:

(float**)a_i 

is not making a_i to be of type (float **). You should take the address: &a_i

Changing those didn't help fix the problem but those are some pointers that you can look to explore.

2 Comments

Right, my bad., sorry
When I compile with SM_50, I get "ptxas info : 'device-function-maxrregcount' is a BETA feature". The other warning goes away.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.