- Notifications
You must be signed in to change notification settings - Fork 15.3k
[clang-repl][CUDA] Move CUDA module registration to beginning of global_ctors #66658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @llvm/pr-subscribers-clang @llvm/pr-subscribers-clang-codegen ChangesCUDA device code needs to be registered to the runtime before kernels can be launched. This is done through a global constructor. User code in Clang interpreter, is also executed through This allows Full diff: https://github.com/llvm/llvm-project/pull/66658.diff 2 Files Affected:
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp index 8b0c9340775cbe9..783865409c778f5 100644 --- a/clang/lib/CodeGen/CodeGenModule.cpp +++ b/clang/lib/CodeGen/CodeGenModule.cpp @@ -794,7 +794,7 @@ void CodeGenModule::Release() { AddGlobalCtor(ObjCInitFunction); if (Context.getLangOpts().CUDA && CUDARuntime) { if (llvm::Function *CudaCtorFunction = CUDARuntime->finalizeModule()) - AddGlobalCtor(CudaCtorFunction); + AddGlobalCtor(CudaCtorFunction, 0); } if (OpenMPRuntime) { if (llvm::Function *OpenMPRequiresDirectiveRegFun = diff --git a/clang/test/Interpreter/CUDA/launch-same-ptu.cu b/clang/test/Interpreter/CUDA/launch-same-ptu.cu new file mode 100644 index 000000000000000..93e203a47212fbf --- /dev/null +++ b/clang/test/Interpreter/CUDA/launch-same-ptu.cu @@ -0,0 +1,21 @@ +// Tests __device__ function calls +// RUN: cat %s | clang-repl --cuda | FileCheck %s + +extern "C" int printf(const char*, ...); + +int var; +int* devptr = nullptr; +printf("cudaMalloc: %d\n", cudaMalloc((void **) &devptr, sizeof(int))); +// CHECK: cudaMalloc: 0 + +__device__ inline void test_device(int* value) { *value = 42; } __global__ void test_kernel(int* value) { test_device(value); } test_kernel<<<1,1>>>(devptr); +printf("CUDA Error: %d\n", cudaGetLastError()); +// CHECK-NEXT: CUDA Error: 0 + +printf("cudaMemcpy: %d\n", cudaMemcpy(&var, devptr, sizeof(int), cudaMemcpyDeviceToHost)); +// CHECK-NEXT: cudaMemcpy: 0 + +printf("Value: %d\n", var); +// CHECK-NEXT: Value: 42 + +%quit |
…al_ctors CUDA device code needs to be registered to the runtime before kernels can be launched. This is done through a global constructor. User code in Clang interpreter, is also executed through global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.
fb806d7 to bed2919 Compare | if (Context.getLangOpts().CUDA && CUDARuntime) { | ||
| if (llvm::Function *CudaCtorFunction = CUDARuntime->finalizeModule()) | ||
| AddGlobalCtor(CudaCtorFunction); | ||
| AddGlobalCtor(CudaCtorFunction, /*Priority=*/0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User code in Clang interpreter, is also executed through global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.
This sounds like an application-specific problem that may be addressable by lowering priority of user code initializers.
In general, I'm very reluctant to change the initialization order to be different from what NVCC generates. We do need to interoperate with NVIDIA's libraries and the change in initialization order is potentially risky. Considering that we have no practical way to test it, and that it appears to address something that affects only one application (and may be dealt with on the app level), I do not think we should change the priority for the clang-generated kernel registration code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying issues is not actually clang-repl specific, it also affects clang. For example, this seems to succeed in nvcc but fails with clang:
#include <cstdio> __global__ void kernel() {} class C { public: C() { kernel<<<1, 1>>>(); printf("Error: %d\n", cudaGetLastError()); } }; C c; int main() {}This is fixed by this patch. Maybe we can look for a proper solution to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very contrived example. While I agree that it currently does not work with CUDA, I am still not convinced that it is a problem that needs to be solved in clang.
Let's assume you've set the priority at X. Launching kernels from dynamic initializers with higher priority will still be broken, so the patch does not solve the problem conceptually.
If you set the priority of CUDA kernel initializers at the highest level (is that the ntent of priority=0?), can you guarantee that kernel registration never depends on anything else that was expected to get initialized before it? We also no longer have any wiggle room to run anything before kernel registration when we need to.
@MaskRay Fangrui, WDYT about bumping dynamic initializer priority in principle? Is there anything else we need to worry about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Artem-B, I don’t think @argentite is pushing particularly for this solution of the problem. It seems we agree that is a problem and the behavior of clang diverges from the reference implementation. I believe we should figure out how to fix it.
Rather than changing the priority we can book a slot for the kernel launch declaration respecting the init order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd start with checking what NVCC generates for the initializers. Considering that ultimately we need to conform to CUDA runtime expectations and given lack of documentation, NVCC-generated code is the only reference we have.
Compile your example with -keep and see what NVCC-generated registration code looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@argentite ping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same fear as @Artem-B, higher than default priorities are also sometimes reserved. We really need to see what nvcc does here, but what I could imagine (at least how I would solve it) is putting the constructor with the same priority before all other constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@argentite, could we revisit this?
| cc: @hahnjo |
| Curious if this helps with a recent issue we spotted ! If yes, I would like to debug more and try finishing up the work left here. |
CUDA device code needs to be registered to the runtime before kernels can be launched. This is done through a global constructor. User code in Clang interpreter, is also executed through
global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.This allows
#include-ing a large portion of code that defines device functions and also launches kernels in clang-repl.