[Feature Preview] Introduce one-sided API with ncclPut #1904

marksantesson · 2025-11-08T04:00:36Z

This is an Early Access release. It contains new features and APIs that may be changed before the official General Access release. In particular, the host one-sided API is not frozen. It also may contain performance degradation that is not representative of the General Access release.

Host One-Sided RMA API

Introduces one-sided ncclPut, ncclSignal and ncclWaitSignal APIs. Please refer to src/nccl.h.in for API definitions.
The one-sided API support both intra-node (NVL) and inter-node (network) operations
The implementation of the one-sided APIs does not use SM (zero-SM)
Put/WaitSignal over network can achieve ~25% latency reduction compared to send/recv.
The API signature, functionalities and final performance are subject to change for the official release.

Support Requirements

The one-sided API requires NCCL window registration of the user buffers
To enable one-sided API over the network, please set
- NCCL_GIN_TYPE=2
- NCCL_NET=IB
Current one-sided APIs do not work with CUDA graph capture

Example

The example below demonstrate a ping-pong communication pattern using the ncclPut and ncclWaitSignal between two ranks.

// Configure NCCL with one-sided RMA support ncclConfig_t config = NCCL_CONFIG_INITIALIZER; config.numRmaCtx = 1; // Enable RMA with 1 context config.blocking = 1; NCCLCHECK(ncclCommInitRankConfig(&comm, nRanks, id, myRank, &config)); int ctx = 0; // Use context 0 for RMA operations // Allocate symmetric memory for RMA operations void *sendbuff, *recvbuff; NCCLCHECK(ncclMemAlloc((void**)&sendbuff, size)); NCCLCHECK(ncclMemAlloc((void**)&recvbuff, size)); // Register both send and receive buffers as symmetric windows for RMA operations ncclWindow_t sendWindow, recvWindow; NCCLCHECK(ncclCommWindowRegister(comm, sendbuff, args.end_size, &sendWindow, NCCL_WIN_COLL_SYMMETRIC)); NCCLCHECK(ncclCommWindowRegister(comm, recvbuff, args.end_size, &recvWindow, NCCL_WIN_COLL_SYMMETRIC)); // Ensure all ranks have completed window registration before proceeding MPICHECK(MPI_Barrier(MPI_COMM_WORLD)); if (myRank == 0) { // Rank 0: wait then put // Wait for signal from peer int nsignals = 1; int peer = 1; NCCLCHECK(ncclWaitSignal(ctx, &peer, &nsignals, 1, NCCL_SIGNAL_DISTINCT, comm, stream)); // Put data with signal to peer's receive buffer NCCLCHECK(ncclPut(ctx, sendbuff, nelems, ncclInt, peer, 0, recvWindow, NCCL_SIGNAL_DISTINCT, comm, stream)); } else { // Rank 1: put then wait // Put data with signal to peer's receive buffer int peer = 0; NCCLCHECK(ncclPut(ctx, sendbuff, nelems, ncclInt, peer, 0, recvWindow, NCCL_SIGNAL_DISTINCT, comm, stream)); // Wait for signal from peer int nsignals = 1; NCCLCHECK(ncclWaitSignal(ctx, &peer, &nsignals, 1, NCCL_SIGNAL_DISTINCT, comm, stream)); } NCCLCHECK(ncclCommWindowDeregister(comm, sendWindow)); NCCLCHECK(ncclCommWindowDeregister(comm, recvWindow)); NCCLCHECK(ncclMemFree(sendbuff)); NCCLCHECK(ncclMemFree(recvbuff)); CUDACHECK(cudaStreamDestroy(stream)); NCCLCHECK(ncclCommFinalize(comm)); NCCLCHECK(ncclCommDestroy(comm));

NCCL Put API Preview

430081a

xiaofanl-nvidia changed the title ~~NCCL Put API Preview~~ [Feature Preview] Introduce one-sided API with ncclPut Nov 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Preview] Introduce one-sided API with ncclPut #1904

[Feature Preview] Introduce one-sided API with ncclPut #1904

Uh oh!

marksantesson commented Nov 8, 2025

Labels

2 participants

[Feature Preview] Introduce one-sided API with ncclPut #1904

Are you sure you want to change the base?

[Feature Preview] Introduce one-sided API with ncclPut #1904

Uh oh!

Conversation

marksantesson commented Nov 8, 2025

Host One-Sided RMA API

Support Requirements

Example

Labels

2 participants