How to force NCCL build to embed PTX for all kernels (prevent linker from stripping ncclDevKernel PTX)?

Question

I am compiling NCCL 2.27.5-1 (I tried also 2.28.9-1) from source for a V100 GPU (sm_70). My goal is to have libnccl.so contain compute_70 PTX for every kernel.

Despite passing explicit -gencode=arch=compute_70,code=compute_70 flags to the build, the final libnccl.so does not contain PTX for the standard ncclDevKernel functions. It only contains PTX for ncclSymDevKernel functions.

However, if I inspect the intermediate object files (e.g., all_gather.o), the PTX for ncclDevKernel is clearly present.

I am using the following command to build NCCL:

make -j src.build \ NVCC_GENCODE="-gencode=arch=compute_70,code=compute_70" \ CUDA_HOME="/opt/cuda-12.6" \ CICC_PATH=$CUDA_HOME/nvvm/bin/cicc \ KEEP=1 \ CUDARTLIB=cudart \ LDFLAGS="-L/opt/cuda-12.6/lib64 -lcudadevrt"

I need also CUDARTLIB shared and I link cudadevrt because otherwise __fatbinwrap_aea09599_22_cuda_device_runtime_cu_71a762bb_14119 is missing when I compile only with compute_XX gencode.

Checking the build artifacts, the PTX generation seems successful at the compilation stage.

# Checking build/obj/device/genobj/all_gather.o cuobjdump --dump-ptx build/obj/device/genobj/all_gather.o | grep ncclDevKernel # .visible .entry _Z31ncclDevKernel_AllGather_RING_LL24ncclDevKernelArgsStorageILm4096EE

When I dump the final shared library, the standard kernels are gone from the PTX section.

cuobjdump --dump-ptx build/lib/libnccl.so | grep ncclDevKernel # empty, only ncclSymDevKernel are present

How can I modify the NCCL build command (or NVLDFLAGS) to force to keep the PTX for all kernels?

Thank you very much @RobertCrovella . Using that command, I can see the PTXs now, though I still don't understand why --dump-ptx didn't show anything about ncclDevKernel. Unfortunately, trying to run a simple all_gather with PyTorch raises this error: Error in rank 0: CUDA error: no kernel image is available for execution on the device — CiZ
– CiZ, Commented Nov 27 at 12:17

Collectives™ on Stack Overflow

How to force NCCL build to embed PTX for all kernels (prevent linker from stripping ncclDevKernel PTX)?

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.