Skip to content

Add Multi-Rail Support Libfabric Transport#19

Open
a-szegel wants to merge 4 commits intoNVIDIA:develfrom
a-szegel:multi-rail
Open

Add Multi-Rail Support Libfabric Transport#19
a-szegel wants to merge 4 commits intoNVIDIA:develfrom
a-szegel:multi-rail

Conversation

@a-szegel
Copy link
Contributor

Making this PR a Draft PR because I have not finished testing yet.

a-szegel and others added 4 commits October 16, 2025 23:37
CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
The previous is_proxy variable equals qp_index. Change the name everywhere for consistency. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Attempt to request FI_PROGRESS_AUTO to see if the libfabric provider supports it, if it doesn't fall back to FI_PROGRESS_MANUAL. FI_PROGRESS_AUTO means that we do not need to call into the progress engine for submitted operations to complete. This means that we can remove the host endpoint from the progress call, and we only need to progress the host endpoint when user calls nvshmem_quiet() from the host. This allows us to set the threading model as FI_THREAD_COMPELTION because the host only progress the host EP, and the proxy only progresses the proxy EP, leading to compliance with FI_THREAD_COMPLETION. An edge case exists here where the user calls nvshmem_quiet() on the host QP_IDX from a GPU kernel, but this is illegial because the user shouldn't be calling QP API's on QP's not provided to them via the qp creation API's. This patch should offer a performance improvement because it reduces the number of EP's that are progressed in the critical path, and it allows the libfabric provider to reduce locking b/c of threading model FI_THREAD_COMPLETION. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
This change implements multi-rail support for the libfabric host proxy transport. The transport changes from having 1 domain with 2 EP's to having 1 host domain on NIC 1 and one proxy domain per NIC. Splitting the host EP and proxy EP into seperate domains was done for simplicity of the code. Every domain resource (including AV) was bound on a 1-1 basis per EP so this change should be a functional no-op. In the future when one implements the QP API on the libfabric host proxy transport, N EP's per domain can be easily extended on this. This code uses a round robin based load balancer to assign messages to NIC's. One NIC will be used for the entire operation call into the libfabric transport (including put-signal), but not including messages that are segmented due to size or MR boundaries. The number of NIC's (domains) per PE are limited by the size of the struct nvshmemt_libfabric_mem_handle_t. A new env variable NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE controls the max number of NIC's per PE. Thank you Justin for contributing an initial implementation of multi-rail which I built on top of. Co-authored-by: Justin Chui <justchiu@amazon.com> Signed-off-by: Seth Zegelstein <szegel@amazon.com>
@a-szegel
Copy link
Contributor Author

Was able to get all perftests running with this PR + #36 and #26

@a-szegel a-szegel marked this pull request as ready for review November 24, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant