- Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Currently clang is able to lower __int128 add/subtract/multiply operations in nvptx and amgpu. However, it lowers __int128 division to compiler-rt lib call __divti3. Currently compiler-rt does not supports nvptx or amgpu target. Even if it does, amdgpu backend does not support ISA level linking, therefore is unable to link compiler-rt after LLVM codegen.
failure on amdgpu: https://godbolt.org/z/4oqPoYGG9
failure on nvptx: https://godbolt.org/z/411M3x4Eh
__int128 division on x86_4 showing lowering to __divti3 https://godbolt.org/z/b793fE7E5
__int128 division with nvcc: https://godbolt.org/z/7WaM7vG9j
compiler-rt implementation of 128 bit integer division: https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/int_div_impl.inc
Ideally, nvptx and amdgpu backend should support ISA level linking and compiler-rt. However, that might take some time.
Another option is to let llvm lower __int128 division to instructions instead of libcall (https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp#L4398). However, this may not worth the effort.
Another option is to implement __divti3 as a inline function in the default clang header for CUDA/HIP. If __int128 division is found in device code, mark it as used. This seems to be a feasible solution.
Another solution is to compile compier-rt as bitcode library and link it through -mlink-bitcode-file by clang. This could be a generic solution for all libcalls.