cuda call fails in destructor

Question

I am new to CUDA and met a problem when writing a singleton/global variable using CUDA. The singleton allocates some cuda memory and tries to free it in the destructor. However, the destructor crashes with cudaError of 29 "driver shutting down".

By some search, I notice the reason might be that the singleton destructor is called after the program exits, when CUDA is already shutdown.

https://github.com/NVlabs/SASSI/issues/4 This link reports a similar issue when the cuda function is called in the destructor of a static member.

https://devtalk.nvidia.com/default/topic/457922/cudafree-crash-in-destructor-when-exit-is-called/ This link reports same question and an unclear solution.

Honestly I do not have much CUDA knowledge so I would like to ask for some detailed explanation and formal solution for this problem.

EDIT:

Thanks to @Robert Crovella's remind I did some tests to reproduce the problem. OK, I found that this problem happens in both singleton and global variables of a std::unordered_map or std::map which call cuda in its value object's destructor.

Working code, no std::map is used:

#include <iostream> #include <map> #define CUDA_CHECK(x) std::cerr << (x) << std::endl; class cuda_user { char* data; public: cuda_user() { std::cerr << "constr" << std::endl; CUDA_CHECK(cudaMalloc((void**)&data, 1024)); } void foo() { std::cerr << "foo" << std::endl; }; ~cuda_user() { std::cerr << "destr" << std::endl; CUDA_CHECK(cudaFree(data)); } }; cuda_user cu; int main() { cu.foo(); }

outputs:

constr 0 foo destr 0

Crashed code, with same cuda_user clas, but std::map used:

#include <iostream> #include <map> #define CUDA_CHECK(x) std::cerr << (x) << std::endl; class cuda_user { char* data; public: cuda_user() { std::cerr << "constr" << std::endl; CUDA_CHECK(cudaMalloc((void**)&data, 1024)); } void foo() { std::cerr << "foo" << std::endl; }; ~cuda_user() { std::cerr << "destr" << std::endl; CUDA_CHECK(cudaFree(data)); } }; std::map<int, cuda_user> map; int main() { map[1].foo(); }

Outputs:

constr 0 foo destr 29 << Error!

Update:

I am using gcc48 and nvcc75 on a CentOS 6.3

Have your singletons clean up before you leave main: int main() { { SingletonController singcont; /* .... */ } return 0; } If you have threads, you should also make sure they end before you leave the inner scope block in main. Then, the descructor of the SingletonController will have no races with other threads and you have a clean shutdown. Of course, most who do singletons, do not do it that way... — BitTickler
– BitTickler, Commented Mar 5, 2016 at 15:07
Don't know anything about CUDA, but is there a particular reason why you need any memory to be freed milliseconds before the operating system eliminates your process and releases all memory anyway? I sense an unfounded fear of "memory leaks" here, but perhaps with CUDA things are different. — Christian Hackl
– Christian Hackl, Commented Mar 5, 2016 at 16:01
Singletons are notoriously tricky and there are at least two canonical implementations that have significant differences. Can you give an simplified example of what you are using? I suspect your exact implementation could affect the viability of the approach suggested by @BitTickler. And if you don't intend to use a "cleanable" singleton design pattern, then the approach suggested by Christian Hackl is probably simplest: eliminate cuda calls from the destructor. — Robert Crovella
– Robert Crovella, Commented Mar 5, 2016 at 16:14
The placement of CUDA calls in a global object outside of main scope will lead to problematic behavior. See here. Although that description mostly focuses on kernel calls in such a class/object, the hazard applies to any CUDA call, as you have discovered. I would suggest your question is arguably a duplicate of that one, at least insofar as you are requesting a description of what is happening, and now that your example is devoid of any singleton character. — Robert Crovella
– Robert Crovella, Commented Mar 6, 2016 at 13:31
@BoLi: Yes, sorry for the terminology mistake. When you use the CUDA runtime API, the CUDA front silently emits a lot of boilerplate support code which has the same scope as your translation unit scope map/singleton/object. Whether your code will work comes down to whether the particular sequence of object instantiation and destruction means there is still an active CUDA context present or not when your classes call their contructor/destructor containing runtime API calls. The order of object instantiation and destruction isn't defined in the language -- this is effectively relying on UB. — talonmies
– talonmies, Commented Mar 6, 2016 at 13:32

5 revs, 2 users 92% · Accepted Answer · 2017-05-23 12:07:59Z

[Expanding comments into a summary answer]

Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.

In detail:

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.

You can see a more complete example of what happens (in the context of a kernel launch) in this answer.

Collectives™ on Stack Overflow

cuda call fails in destructor

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related