Skip to content

Conversation

@marksantesson
Copy link
Collaborator

This is an Early Access release. It contains new features and APIs that may be changed before the official General Access release. In particular, the host one-sided API is not frozen. It also may contain performance degradation that is not representative of the General Access release.

Host One-Sided RMA API

  • Introduces one-sided ncclPut, ncclSignal and ncclWaitSignal APIs. Please refer to src/nccl.h.in for API definitions.
  • The one-sided API support both intra-node (NVL) and inter-node (network) operations
  • The implementation of the one-sided APIs does not use SM (zero-SM)
  • Put/WaitSignal over network can achieve ~25% latency reduction compared to send/recv.
  • The API signature, functionalities and final performance are subject to change for the official release.

Support Requirements

  • The one-sided API requires NCCL window registration of the user buffers
  • To enable one-sided API over the network, please set
    • NCCL_GIN_TYPE=2
    • NCCL_NET=IB
  • Current one-sided APIs do not work with CUDA graph capture

Example

The example below demonstrate a ping-pong communication pattern using the ncclPut and ncclWaitSignal between two ranks.

// Configure NCCL with one-sided RMA support ncclConfig_t config = NCCL_CONFIG_INITIALIZER; config.numRmaCtx = 1; // Enable RMA with 1 context config.blocking = 1; NCCLCHECK(ncclCommInitRankConfig(&comm, nRanks, id, myRank, &config)); int ctx = 0; // Use context 0 for RMA operations // Allocate symmetric memory for RMA operations void *sendbuff, *recvbuff; NCCLCHECK(ncclMemAlloc((void**)&sendbuff, size)); NCCLCHECK(ncclMemAlloc((void**)&recvbuff, size)); // Register both send and receive buffers as symmetric windows for RMA operations ncclWindow_t sendWindow, recvWindow; NCCLCHECK(ncclCommWindowRegister(comm, sendbuff, args.end_size, &sendWindow, NCCL_WIN_COLL_SYMMETRIC)); NCCLCHECK(ncclCommWindowRegister(comm, recvbuff, args.end_size, &recvWindow, NCCL_WIN_COLL_SYMMETRIC)); // Ensure all ranks have completed window registration before proceeding MPICHECK(MPI_Barrier(MPI_COMM_WORLD)); if (myRank == 0) { // Rank 0: wait then put // Wait for signal from peer int nsignals = 1; int peer = 1; NCCLCHECK(ncclWaitSignal(ctx, &peer, &nsignals, 1, NCCL_SIGNAL_DISTINCT, comm, stream)); // Put data with signal to peer's receive buffer NCCLCHECK(ncclPut(ctx, sendbuff, nelems, ncclInt, peer, 0, recvWindow, NCCL_SIGNAL_DISTINCT, comm, stream)); } else { // Rank 1: put then wait // Put data with signal to peer's receive buffer int peer = 0; NCCLCHECK(ncclPut(ctx, sendbuff, nelems, ncclInt, peer, 0, recvWindow, NCCL_SIGNAL_DISTINCT, comm, stream)); // Wait for signal from peer int nsignals = 1; NCCLCHECK(ncclWaitSignal(ctx, &peer, &nsignals, 1, NCCL_SIGNAL_DISTINCT, comm, stream)); } NCCLCHECK(ncclCommWindowDeregister(comm, sendWindow)); NCCLCHECK(ncclCommWindowDeregister(comm, recvWindow)); NCCLCHECK(ncclMemFree(sendbuff)); NCCLCHECK(ncclMemFree(recvbuff)); CUDACHECK(cudaStreamDestroy(stream)); NCCLCHECK(ncclCommFinalize(comm)); NCCLCHECK(ncclCommDestroy(comm));
@xiaofanl-nvidia xiaofanl-nvidia changed the title NCCL Put API Preview [Feature Preview] Introduce one-sided API with ncclPut Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants