Skip to content

Conversation

@argentite
Copy link
Contributor

CUDA device code needs to be registered to the runtime before kernels can be launched. This is done through a global constructor. User code in Clang interpreter, is also executed through global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.

This allows #include-ing a large portion of code that defines device functions and also launches kernels in clang-repl.

@argentite argentite added the cuda label Sep 18, 2023
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. labels Sep 18, 2023
@llvmbot
Copy link
Member

llvmbot commented Sep 18, 2023

@llvm/pr-subscribers-clang

@llvm/pr-subscribers-clang-codegen

Changes

CUDA device code needs to be registered to the runtime before kernels can be launched. This is done through a global constructor. User code in Clang interpreter, is also executed through global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.

This allows #include-ing a large portion of code that defines device functions and also launches kernels in clang-repl.


Full diff: https://github.com/llvm/llvm-project/pull/66658.diff

2 Files Affected:

  • (modified) clang/lib/CodeGen/CodeGenModule.cpp (+1-1)
  • (added) clang/test/Interpreter/CUDA/launch-same-ptu.cu (+21)
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp index 8b0c9340775cbe9..783865409c778f5 100644 --- a/clang/lib/CodeGen/CodeGenModule.cpp +++ b/clang/lib/CodeGen/CodeGenModule.cpp @@ -794,7 +794,7 @@ void CodeGenModule::Release() { AddGlobalCtor(ObjCInitFunction); if (Context.getLangOpts().CUDA && CUDARuntime) { if (llvm::Function *CudaCtorFunction = CUDARuntime->finalizeModule()) - AddGlobalCtor(CudaCtorFunction); + AddGlobalCtor(CudaCtorFunction, 0); } if (OpenMPRuntime) { if (llvm::Function *OpenMPRequiresDirectiveRegFun = diff --git a/clang/test/Interpreter/CUDA/launch-same-ptu.cu b/clang/test/Interpreter/CUDA/launch-same-ptu.cu new file mode 100644 index 000000000000000..93e203a47212fbf --- /dev/null +++ b/clang/test/Interpreter/CUDA/launch-same-ptu.cu @@ -0,0 +1,21 @@ +// Tests __device__ function calls +// RUN: cat %s | clang-repl --cuda | FileCheck %s + +extern "C" int printf(const char*, ...); + +int var; +int* devptr = nullptr; +printf("cudaMalloc: %d\n", cudaMalloc((void **) &devptr, sizeof(int))); +// CHECK: cudaMalloc: 0 + +__device__ inline void test_device(int* value) { *value = 42; } __global__ void test_kernel(int* value) { test_device(value); } test_kernel<<<1,1>>>(devptr); +printf("CUDA Error: %d\n", cudaGetLastError()); +// CHECK-NEXT: CUDA Error: 0 + +printf("cudaMemcpy: %d\n", cudaMemcpy(&var, devptr, sizeof(int), cudaMemcpyDeviceToHost)); +// CHECK-NEXT: cudaMemcpy: 0 + +printf("Value: %d\n", var); +// CHECK-NEXT: Value: 42 + +%quit 
…al_ctors CUDA device code needs to be registered to the runtime before kernels can be launched. This is done through a global constructor. User code in Clang interpreter, is also executed through global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.
if (Context.getLangOpts().CUDA && CUDARuntime) {
if (llvm::Function *CudaCtorFunction = CUDARuntime->finalizeModule())
AddGlobalCtor(CudaCtorFunction);
AddGlobalCtor(CudaCtorFunction, /*Priority=*/0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User code in Clang interpreter, is also executed through global_ctors. This patch ensures kernels can be launched in the same iteration it is defined in by making the registration first in the list.

This sounds like an application-specific problem that may be addressable by lowering priority of user code initializers.

In general, I'm very reluctant to change the initialization order to be different from what NVCC generates. We do need to interoperate with NVIDIA's libraries and the change in initialization order is potentially risky. Considering that we have no practical way to test it, and that it appears to address something that affects only one application (and may be dealt with on the app level), I do not think we should change the priority for the clang-generated kernel registration code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying issues is not actually clang-repl specific, it also affects clang. For example, this seems to succeed in nvcc but fails with clang:

#include <cstdio> __global__ void kernel() {} class C { public: C() { kernel<<<1, 1>>>(); printf("Error: %d\n", cudaGetLastError()); } }; C c; int main() {}

This is fixed by this patch. Maybe we can look for a proper solution to this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very contrived example. While I agree that it currently does not work with CUDA, I am still not convinced that it is a problem that needs to be solved in clang.

Let's assume you've set the priority at X. Launching kernels from dynamic initializers with higher priority will still be broken, so the patch does not solve the problem conceptually.

If you set the priority of CUDA kernel initializers at the highest level (is that the ntent of priority=0?), can you guarantee that kernel registration never depends on anything else that was expected to get initialized before it? We also no longer have any wiggle room to run anything before kernel registration when we need to.

@MaskRay Fangrui, WDYT about bumping dynamic initializer priority in principle? Is there anything else we need to worry about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Artem-B, I don’t think @argentite is pushing particularly for this solution of the problem. It seems we agree that is a problem and the behavior of clang diverges from the reference implementation. I believe we should figure out how to fix it.

Rather than changing the priority we can book a slot for the kernel launch declaration respecting the init order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd start with checking what NVCC generates for the initializers. Considering that ultimately we need to conform to CUDA runtime expectations and given lack of documentation, NVCC-generated code is the only reference we have.
Compile your example with -keep and see what NVCC-generated registration code looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@argentite ping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same fear as @Artem-B, higher than default priorities are also sometimes reserved. We really need to see what nvcc does here, but what I could imagine (at least how I would solve it) is putting the constructor with the same priority before all other constructors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@argentite, could we revisit this?

@vgvassilev
Copy link
Contributor

cc: @hahnjo

@anutosh491
Copy link
Member

Curious if this helps with a recent issue we spotted !

#158021

If yes, I would like to debug more and try finishing up the work left here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:codegen IR generation bugs: mangling, exceptions, etc. clang Clang issues not falling into any other category cuda

6 participants