Add Multi-Rail Support Libfabric Transport by a-szegel · Pull Request #19 · NVIDIA/nvshmem

a-szegel · 2025-10-28T16:49:22Z

Making this PR a Draft PR because I have not finished testing yet.

CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed. Signed-off-by: Seth Zegelstein <szegel@amazon.com>

The previous is_proxy variable equals qp_index. Change the name everywhere for consistency. Signed-off-by: Seth Zegelstein <szegel@amazon.com>

Attempt to request FI_PROGRESS_AUTO to see if the libfabric provider supports it, if it doesn't fall back to FI_PROGRESS_MANUAL. FI_PROGRESS_AUTO means that we do not need to call into the progress engine for submitted operations to complete. This means that we can remove the host endpoint from the progress call, and we only need to progress the host endpoint when user calls nvshmem_quiet() from the host. This allows us to set the threading model as FI_THREAD_COMPELTION because the host only progress the host EP, and the proxy only progresses the proxy EP, leading to compliance with FI_THREAD_COMPLETION. An edge case exists here where the user calls nvshmem_quiet() on the host QP_IDX from a GPU kernel, but this is illegial because the user shouldn't be calling QP API's on QP's not provided to them via the qp creation API's. This patch should offer a performance improvement because it reduces the number of EP's that are progressed in the critical path, and it allows the libfabric provider to reduce locking b/c of threading model FI_THREAD_COMPLETION. Signed-off-by: Seth Zegelstein <szegel@amazon.com>

This change implements multi-rail support for the libfabric host proxy transport. The transport changes from having 1 domain with 2 EP's to having 1 host domain on NIC 1 and one proxy domain per NIC. Splitting the host EP and proxy EP into seperate domains was done for simplicity of the code. Every domain resource (including AV) was bound on a 1-1 basis per EP so this change should be a functional no-op. In the future when one implements the QP API on the libfabric host proxy transport, N EP's per domain can be easily extended on this. This code uses a round robin based load balancer to assign messages to NIC's. One NIC will be used for the entire operation call into the libfabric transport (including put-signal), but not including messages that are segmented due to size or MR boundaries. The number of NIC's (domains) per PE are limited by the size of the struct nvshmemt_libfabric_mem_handle_t. A new env variable NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE controls the max number of NIC's per PE. Thank you Justin for contributing an initial implementation of multi-rail which I built on top of. Co-authored-by: Justin Chui <justchiu@amazon.com> Signed-off-by: Seth Zegelstein <szegel@amazon.com>

a-szegel · 2025-11-24T19:30:12Z

Was able to get all perftests running with this PR + #36 and #26

a-szegel and others added 4 commits October 16, 2025 23:37

transport: Deprecate enforce_cst

7495480

CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed. Signed-off-by: Seth Zegelstein <szegel@amazon.com>

transport/libfabric: Rename is_proxy to qp_index

9f4c458

The previous is_proxy variable equals qp_index. Change the name everywhere for consistency. Signed-off-by: Seth Zegelstein <szegel@amazon.com>

a-szegel marked this pull request as ready for review November 24, 2025 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multi-Rail Support Libfabric Transport#19

Add Multi-Rail Support Libfabric Transport#19
a-szegel wants to merge 4 commits intoNVIDIA:develfrom
a-szegel:multi-rail

a-szegel commented Oct 28, 2025

a-szegel commented Nov 24, 2025

Labels

1 participant

Conversation

a-szegel commented Oct 28, 2025

a-szegel commented Nov 24, 2025

Labels

1 participant