I am compiling NCCL 2.27.5-1 (I tried also 2.28.9-1) from source for a V100 GPU (sm_70). My goal is to have libnccl.so contain compute_70 PTX for every kernel.
Despite passing explicit -gencode=arch=compute_70,code=compute_70 flags to the build, the final libnccl.so does not contain PTX for the standard ncclDevKernel functions. It only contains PTX for ncclSymDevKernel functions.
However, if I inspect the intermediate object files (e.g., all_gather.o), the PTX for ncclDevKernel is clearly present.
I am using the following command to build NCCL:
make -j src.build \ NVCC_GENCODE="-gencode=arch=compute_70,code=compute_70" \ CUDA_HOME="/opt/cuda-12.6" \ CICC_PATH=$CUDA_HOME/nvvm/bin/cicc \ KEEP=1 \ CUDARTLIB=cudart \ LDFLAGS="-L/opt/cuda-12.6/lib64 -lcudadevrt" I need also CUDARTLIB shared and I link cudadevrt because otherwise __fatbinwrap_aea09599_22_cuda_device_runtime_cu_71a762bb_14119 is missing when I compile only with compute_XX gencode.
Checking the build artifacts, the PTX generation seems successful at the compilation stage.
# Checking build/obj/device/genobj/all_gather.o cuobjdump --dump-ptx build/obj/device/genobj/all_gather.o | grep ncclDevKernel # .visible .entry _Z31ncclDevKernel_AllGather_RING_LL24ncclDevKernelArgsStorageILm4096EE When I dump the final shared library, the standard kernels are gone from the PTX section.
cuobjdump --dump-ptx build/lib/libnccl.so | grep ncclDevKernel # empty, only ncclSymDevKernel are present How can I modify the NCCL build command (or NVLDFLAGS) to force to keep the PTX for all kernels?
cuobjdump -all -ptx ...Error in rank 0: CUDA error: no kernel image is available for execution on the device