Skip to content

Add realtime host dispatcher#4180

Open
wsttiger wants to merge 12 commits intoNVIDIA:mainfrom
wsttiger:cudaq_realtime_host_dispatcher_sandbox
Open

Add realtime host dispatcher#4180
wsttiger wants to merge 12 commits intoNVIDIA:mainfrom
wsttiger:cudaq_realtime_host_dispatcher_sandbox

Conversation

@wsttiger
Copy link

Summary

This PR refines the host-side dispatcher backend and adds end-to-end test coverage for the GRAPH_LAUNCH dispatch path.

Restrict host dispatcher to GRAPH_LAUNCH only — The host loop now only dispatches GRAPH_LAUNCH entries; HOST_CALL and DEVICE_CALL slots are dropped (cleared and advanced). Removes the unused dispatch_host_call path and updates comments/headers to reflect the GRAPH_LAUNCH-only design.

Add host dispatcher tests + external mailbox support — New test file test_host_dispatcher.cu with two tests:

  • Smoke test: starts the host loop via the C API, sends an RPC with an unknown function_id, and verifies the slot is silently dropped.
  • GRAPH_LAUNCH round-trip: full end-to-end test through the C API — allocates a pinned mailbox, captures an increment graph, wires the dispatcher, sends an RPC {0,1,2,3}, and asserts the graph produces {1,2,3,4} in-place.

New C API: cudaq_dispatcher_set_mailbox — Lets callers provide a caller-managed pinned (cudaHostAllocMapped) mailbox. This is required for GRAPH_LAUNCH because the graph must be captured with the device-side mailbox pointer before the dispatcher starts, and the internal allocation (plain new) is not device-visible. When no external mailbox is provided, the C API falls back to internal allocation (backward compatible).

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wsttiger wsttiger changed the title Cudaq realtime host dispatcher sandbox Add realtime host dispatcher Mar 18, 2026
@bmhowe23
Copy link
Collaborator

bmhowe23 commented Mar 18, 2026

/ok to test 6e48865

Command Bot: Processing...

@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

@wsttiger
Copy link
Author

wsttiger commented Mar 23, 2026

/ok to test 4f3a446

Command Bot: Processing...

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Disable GCC 13's AMX tile intrinsics via -mno-amx-tile when building with CUDA 12.x on x86_64, fixing nvcc errors for undefined __builtin_ia32_ldtilecfg/__builtin_ia32_sttilecfg in system headers. Also quote CUDA_NATIVE_ARCH export to prevent semicolons from being interpreted as shell command separators. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Move set(CMAKE_CUDA_FLAGS) after enable_language(CUDA) so that the CUDAFLAGS environment variable is properly picked up by CMake. Without this, the env var was silently ignored because a normal CMake variable shadowed the cache entry before it was initialized. In the CI, export CUDAFLAGS="-Xcompiler -mno-amx-tile" for CUDA 12.x on x86_64 to work around nvcc not supporting GCC 13's AMX tile intrinsics. Also quote the CUDA_NATIVE_ARCH export to prevent semicolons from acting as shell command separators. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Revert the CMakeLists.txt reorder (set CMAKE_CUDA_FLAGS must stay before enable_language(CUDA)) because moving it after caused CMake to add a default CMAKE_CUDA_ARCHITECTURES (sm_52), breaking compilation of DOCA headers that require atomicCAS_block (compute capability 7.0+). Switch from CUDAFLAGS to NVCC_PREPEND_FLAGS environment variable, which is read directly by nvcc at invocation time and bypasses CMake's variable initialization entirely. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Install gcc-12/g++-12 and set CC, CXX, CUDAHOSTCXX for the CUDA 12.x x86_64 build to avoid GCC 13's AMX tile intrinsics that nvcc 12.6 cannot parse. Also keep the CUDA_NATIVE_ARCH quoting fix. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Copy link
Collaborator

@boschmitt boschmitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @wsttiger,

We also need documentation about this feature.

} cudaq_tx_status_t;

// RPC wire-format constants — single source of truth in rpc_wire_format.h.
#include "cudaq/realtime/daemon/dispatcher/rpc_wire_format.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like weird place the #include a file. Why does it need to be here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +32 to +35
typedef enum {
CUDAQ_BACKEND_DEVICE_KERNEL = 0,
CUDAQ_BACKEND_HOST_LOOP = 1
} cudaq_backend_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find better names for these? "backend" is already used in CUDA-Q, and means something quite different.

Would it be fair to say that these are distinguishing between having the control path in the GPU vs CPU. That is, CUDAQ_BACKEND_DEVICE_KERNEL indicates that is the GPU that is controlling the NIC, while CUDAQ_BACKEND_HOST_LOOP indicates that the CPU is controlling the NIC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed:

cudaq_backend_t → cudaq_dispatch_path_t
CUDAQ_BACKEND_DEVICE_KERNEL → CUDAQ_DISPATCH_PATH_DEVICE
CUDAQ_BACKEND_HOST_LOOP → CUDAQ_DISPATCH_PATH_HOST
.backend field → .dispatch_path

Comment on lines +80 to +82
// RPC framing magic values — sourced from rpc_wire_format.h.
constexpr std::uint32_t RPC_MAGIC_REQUEST = CUDAQ_RPC_MAGIC_REQUEST;
constexpr std::uint32_t RPC_MAGIC_RESPONSE = CUDAQ_RPC_MAGIC_RESPONSE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this indirection ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rpc_wire_format.h header defines the magic values as C preprocessor macros so they work in both C and CUDA device code. The constexpr wrappers here give C++ code typed constants within the cudaq::realtime namespace, which avoids macro pollution and plays nicely with templates. The macros remain the single source of truth — these just forward them into C++ land.

Comment on lines +44 to +78
typedef struct {
void *rx_flags; ///< opaque cuda::std::atomic<uint64_t>*
void *tx_flags; ///< opaque cuda::std::atomic<uint64_t>*
uint8_t *rx_data_host;
uint8_t *rx_data_dev;
uint8_t *tx_data_host;
uint8_t *tx_data_dev;
size_t tx_stride_sz;
void **h_mailbox_bank;
size_t num_slots;
size_t slot_size;
cudaq_host_dispatch_worker_t *workers;
size_t num_workers;
/// Host-visible function table for lookup by function_id (GRAPH_LAUNCH only;
/// others dropped).
cudaq_function_entry_t *function_table;
size_t function_table_count;
void *shutdown_flag; ///< opaque cuda::std::atomic<int>*
uint64_t *stats_counter;
void *live_dispatched; ///< opaque cuda::std::atomic<uint64_t>*
void *idle_mask; ///< opaque cuda::std::atomic<uint64_t>*, 1=free 0=busy
int *inflight_slot_tags; ///< worker_id -> origin FPGA slot for tx_flags
///< routing

/// Device view of tx_flags (needed for GraphIOContext.tx_flag).
/// NULL when tx_flags is already a device-accessible pointer.
volatile uint64_t *tx_flags_dev;

/// Per-worker GraphIOContext array for separate RX/TX buffer support.
/// When non-NULL, launch_graph_worker fills a GraphIOContext per dispatch
/// and writes its device address into h_mailbox_bank[worker_id].
/// When NULL, legacy mode: raw RX slot pointer written to mailbox.
void *io_ctxs_host; ///< host view of GraphIOContext[num_workers]
void *io_ctxs_dev; ///< device view of same pinned mapped memory
} cudaq_host_dispatcher_config_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is odd. We already have cudaq_dispatcher_t and cudaq_dispatcher_config_t. This ones seems this one seems like a redundant flattening of both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to cudaq_host_dispatch_loop_ctx_t to clarify this isn't a user-facing config — it's an internal runtime context captured by value into the dispatch loop thread. It bundles a subset of fields from the ringbuffer, dispatcher config, and function table, plus host-specific runtime state (workers, idle_mask, io_ctxs_*, etc.) that doesn't belong in the public API structs. The device kernel path passes these as individual kernel arguments; the host path needs a struct because std::thread captures a single lambda value. Composing the public types would add unused fields and force verbose access patterns with no real benefit.

uint8_t *rx_data_dev;
uint8_t *tx_data_host;
uint8_t *tx_data_dev;
size_t tx_stride_sz;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we actually need this struct (see another comment), but, if we do, are we assuming tx_stride_sz == rx_stride_sz? The cudaq_ringbuffer_t struct has both separate, shouldn't that be the case here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slot_size serves as the RX stride — in a contiguous ring buffer, slot capacity equals stride. tx_stride_sz is a separate field because the GraphIOContext path supports TX slots with a different size than RX. So we're not assuming they're equal; they're intentionally tracked independently. Renamed the struct to _ctx_t to make it clearer this is an internal runtime context, not a mirror of cudaq_ringbuffer_t.

host_config.num_workers = num_workers;
host_config.function_table = table->entries;
host_config.function_table_count = table->count;
host_config.shutdown_flag = (void *)(uintptr_t)shutdown_flag;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't host_config.shutdown_flag be backed by a cuda::std::atomic<int> ? Here it is taking a volatile int, and thus the underlying storage won't be an atomic. Perhaps this is not a big issue for a shutdown flag, but different writers and readers will be seeing this differently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volatile int* is intentional here — this is a C API boundary (extern "C") so we need plain C types for ABI stability. Internally, the dispatch loop casts it to cuda::std::atomic* to get acquire semantics on the reader side. This is safe because cuda::std::atomic is lock-free and layout-compatible with int on all CUDA-supported platforms. The writer (caller) only ever does a single store of 1 to signal shutdown, so no atomic RMW or complex protocol is needed from the C side. Added a comment at the cast site to document this.

GCC 12's headers also contain intrinsic builtins (AVX512-BF16, AMX) that CUDA 12.6's nvcc cannot parse. GCC 11 has none of these issues and matches the gcc-toolset-11 used in the Dockerfile-based build. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
All GCC versions on Ubuntu 24.04 include AMX tile intrinsic headers that CUDA 12.x's nvcc cannot parse. Rather than swapping compilers, pass -mno-amx-tile to the host compiler via --compiler-options in CMakeLists.txt on x86_64 to suppress the offending builtins. Remove the now-unnecessary GCC 11 installation from CI. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

1 similar comment
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

The existing -mno-amx-tile workaround is insufficient with GCC 13+ because the AMX intrinsic headers use #pragma GCC target("amx-tile") to force-enable builtins that nvcc cannot parse. Pre-define the include guards for all five GCC AMX headers so they are skipped entirely during nvcc host compilation. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Rename cudaq_backend_t to cudaq_dispatch_path_t to avoid conflicting with existing CUDA-Q "backend" terminology. Move rpc_wire_format.h include to the top of cudaq_realtime.h. Add comment documenting the volatile int* to cuda::std::atomic<int>* cast for shutdown_flag (C ABI boundary). Fail explicitly when cudaHostGetDevicePointer fails instead of silently falling back to the wrong buffer path. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The struct is an internal runtime context captured by the dispatch loop thread, not a user-facing configuration. The old name suggested it was a variant of cudaq_dispatcher_config_t, which caused confusion in review. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

1 similar comment
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Embed cudaq_ringbuffer_t, cudaq_dispatcher_config_t, and cudaq_function_table_t as members instead of flattening their fields. This eliminates field duplication, makes the data provenance clear, and simplifies construction to struct copies. Host-specific runtime state (workers, idle_mask, io_ctxs, etc.) remains as direct fields. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@github-actions
Copy link

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants