Add realtime host dispatcher by wsttiger · Pull Request #4180 · NVIDIA/cuda-quantum

wsttiger · 2026-03-18T02:28:45Z

Summary

This PR refines the host-side dispatcher backend and adds end-to-end test coverage for the GRAPH_LAUNCH dispatch path.

Restrict host dispatcher to GRAPH_LAUNCH only — The host loop now only dispatches GRAPH_LAUNCH entries; HOST_CALL and DEVICE_CALL slots are dropped (cleared and advanced). Removes the unused dispatch_host_call path and updates comments/headers to reflect the GRAPH_LAUNCH-only design.

Add host dispatcher tests + external mailbox support — New test file test_host_dispatcher.cu with two tests:

Smoke test: starts the host loop via the C API, sends an RPC with an unknown function_id, and verifies the slot is silently dropped.
GRAPH_LAUNCH round-trip: full end-to-end test through the C API — allocates a pinned mailbox, captures an increment graph, wires the dispatcher, sends an RPC {0,1,2,3}, and asserts the graph produces {1,2,3,4} in-place.

New C API: cudaq_dispatcher_set_mailbox — Lets callers provide a caller-managed pinned (cudaHostAllocMapped) mailbox. This is required for GRAPH_LAUNCH because the graph must be captured with the device-side mailbox pointer before the dispatcher starts, and the internal allocation (plain new) is not device-visible. When no external mailbox is provided, the C API falls back to internal allocation (backward compatible).

copy-pr-bot · 2026-03-18T02:28:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bmhowe23 · 2026-03-18T22:52:15Z

/ok to test 6e48865

Command Bot: Processing...

github-actions · 2026-03-19T22:10:50Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

wsttiger · 2026-03-23T14:19:27Z

/ok to test 4f3a446

Command Bot: Processing...

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

github-actions · 2026-03-23T17:09:13Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Disable GCC 13's AMX tile intrinsics via -mno-amx-tile when building with CUDA 12.x on x86_64, fixing nvcc errors for undefined __builtin_ia32_ldtilecfg/__builtin_ia32_sttilecfg in system headers. Also quote CUDA_NATIVE_ARCH export to prevent semicolons from being interpreted as shell command separators. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Move set(CMAKE_CUDA_FLAGS) after enable_language(CUDA) so that the CUDAFLAGS environment variable is properly picked up by CMake. Without this, the env var was silently ignored because a normal CMake variable shadowed the cache entry before it was initialized. In the CI, export CUDAFLAGS="-Xcompiler -mno-amx-tile" for CUDA 12.x on x86_64 to work around nvcc not supporting GCC 13's AMX tile intrinsics. Also quote the CUDA_NATIVE_ARCH export to prevent semicolons from acting as shell command separators. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

github-actions · 2026-03-23T19:15:48Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Revert the CMakeLists.txt reorder (set CMAKE_CUDA_FLAGS must stay before enable_language(CUDA)) because moving it after caused CMake to add a default CMAKE_CUDA_ARCHITECTURES (sm_52), breaking compilation of DOCA headers that require atomicCAS_block (compute capability 7.0+). Switch from CUDAFLAGS to NVCC_PREPEND_FLAGS environment variable, which is read directly by nvcc at invocation time and bypasses CMake's variable initialization entirely. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Install gcc-12/g++-12 and set CC, CXX, CUDAHOSTCXX for the CUDA 12.x x86_64 build to avoid GCC 13's AMX tile intrinsics that nvcc 12.6 cannot parse. Also keep the CUDA_NATIVE_ARCH quoting fix. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

boschmitt

Nice work @wsttiger,

We also need documentation about this feature.

boschmitt · 2026-03-23T10:51:28Z

realtime/include/cudaq/realtime/daemon/dispatcher/cudaq_realtime.h

+} cudaq_tx_status_t;
+
+// RPC wire-format constants — single source of truth in rpc_wire_format.h.
+#include "cudaq/realtime/daemon/dispatcher/rpc_wire_format.h"


Seems like weird place the #include a file. Why does it need to be here ?

boschmitt · 2026-03-23T11:00:37Z

realtime/include/cudaq/realtime/daemon/dispatcher/cudaq_realtime.h

+typedef enum {
+ CUDAQ_BACKEND_DEVICE_KERNEL = 0,
+ CUDAQ_BACKEND_HOST_LOOP = 1
+} cudaq_backend_t;


Can we find better names for these? "backend" is already used in CUDA-Q, and means something quite different.

Would it be fair to say that these are distinguishing between having the control path in the GPU vs CPU. That is, CUDAQ_BACKEND_DEVICE_KERNEL indicates that is the GPU that is controlling the NIC, while CUDAQ_BACKEND_HOST_LOOP indicates that the CPU is controlling the NIC?

Renamed:

cudaq_backend_t → cudaq_dispatch_path_t
CUDAQ_BACKEND_DEVICE_KERNEL → CUDAQ_DISPATCH_PATH_DEVICE
CUDAQ_BACKEND_HOST_LOOP → CUDAQ_DISPATCH_PATH_HOST
.backend field → .dispatch_path

boschmitt · 2026-03-23T11:13:21Z

realtime/include/cudaq/realtime/daemon/dispatcher/dispatch_kernel_launch.h

+// RPC framing magic values — sourced from rpc_wire_format.h.
+constexpr std::uint32_t RPC_MAGIC_REQUEST = CUDAQ_RPC_MAGIC_REQUEST;
+constexpr std::uint32_t RPC_MAGIC_RESPONSE = CUDAQ_RPC_MAGIC_RESPONSE;


Do we need this indirection ?

The rpc_wire_format.h header defines the magic values as C preprocessor macros so they work in both C and CUDA device code. The constexpr wrappers here give C++ code typed constants within the cudaq::realtime namespace, which avoids macro pollution and plays nicely with templates. The macros remain the single source of truth — these just forward them into C++ land.

boschmitt · 2026-03-23T18:50:45Z

realtime/include/cudaq/realtime/daemon/dispatcher/host_dispatcher.h

+typedef struct {
+ void *rx_flags; ///< opaque cuda::std::atomic<uint64_t>*
+ void *tx_flags; ///< opaque cuda::std::atomic<uint64_t>*
+ uint8_t *rx_data_host;
+ uint8_t *rx_data_dev;
+ uint8_t *tx_data_host;
+ uint8_t *tx_data_dev;
+ size_t tx_stride_sz;
+ void **h_mailbox_bank;
+ size_t num_slots;
+ size_t slot_size;
+ cudaq_host_dispatch_worker_t *workers;
+ size_t num_workers;
+ /// Host-visible function table for lookup by function_id (GRAPH_LAUNCH only;
+ /// others dropped).
+ cudaq_function_entry_t *function_table;
+ size_t function_table_count;
+ void *shutdown_flag; ///< opaque cuda::std::atomic<int>*
+ uint64_t *stats_counter;
+ void *live_dispatched; ///< opaque cuda::std::atomic<uint64_t>*
+ void *idle_mask; ///< opaque cuda::std::atomic<uint64_t>*, 1=free 0=busy
+ int *inflight_slot_tags; ///< worker_id -> origin FPGA slot for tx_flags
+ ///< routing
+
+ /// Device view of tx_flags (needed for GraphIOContext.tx_flag).
+ /// NULL when tx_flags is already a device-accessible pointer.
+ volatile uint64_t *tx_flags_dev;
+
+ /// Per-worker GraphIOContext array for separate RX/TX buffer support.
+ /// When non-NULL, launch_graph_worker fills a GraphIOContext per dispatch
+ /// and writes its device address into h_mailbox_bank[worker_id].
+ /// When NULL, legacy mode: raw RX slot pointer written to mailbox.
+ void *io_ctxs_host; ///< host view of GraphIOContext[num_workers]
+ void *io_ctxs_dev; ///< device view of same pinned mapped memory
+} cudaq_host_dispatcher_config_t;


This struct is odd. We already have cudaq_dispatcher_t and cudaq_dispatcher_config_t. This ones seems this one seems like a redundant flattening of both.

Renamed to cudaq_host_dispatch_loop_ctx_t to clarify this isn't a user-facing config — it's an internal runtime context captured by value into the dispatch loop thread. It bundles a subset of fields from the ringbuffer, dispatcher config, and function table, plus host-specific runtime state (workers, idle_mask, io_ctxs_*, etc.) that doesn't belong in the public API structs. The device kernel path passes these as individual kernel arguments; the host path needs a struct because std::thread captures a single lambda value. Composing the public types would add unused fields and force verbose access patterns with no real benefit.

realtime/lib/daemon/dispatcher/host_dispatcher_capi.cu

boschmitt · 2026-03-23T19:18:34Z

realtime/include/cudaq/realtime/daemon/dispatcher/host_dispatcher.h

+ uint8_t *rx_data_dev;
+ uint8_t *tx_data_host;
+ uint8_t *tx_data_dev;
+ size_t tx_stride_sz;


I'm not sure if we actually need this struct (see another comment), but, if we do, are we assuming tx_stride_sz == rx_stride_sz? The cudaq_ringbuffer_t struct has both separate, shouldn't that be the case here ?

slot_size serves as the RX stride — in a contiguous ring buffer, slot capacity equals stride. tx_stride_sz is a separate field because the GraphIOContext path supports TX slots with a different size than RX. So we're not assuming they're equal; they're intentionally tracked independently. Renamed the struct to _ctx_t to make it clearer this is an internal runtime context, not a mirror of cudaq_ringbuffer_t.

boschmitt · 2026-03-23T19:40:57Z

realtime/lib/daemon/dispatcher/host_dispatcher_capi.cu

+ host_config.num_workers = num_workers;
+ host_config.function_table = table->entries;
+ host_config.function_table_count = table->count;
+ host_config.shutdown_flag = (void *)(uintptr_t)shutdown_flag;


Shouldn't host_config.shutdown_flag be backed by a cuda::std::atomic<int> ? Here it is taking a volatile int, and thus the underlying storage won't be an atomic. Perhaps this is not a big issue for a shutdown flag, but different writers and readers will be seeing this differently.

The volatile int* is intentional here — this is a C API boundary (extern "C") so we need plain C types for ABI stability. Internally, the dispatch loop casts it to cuda::std::atomic* to get acquire semantics on the reader side. This is safe because cuda::std::atomic is lock-free and layout-compatible with int on all CUDA-supported platforms. The writer (caller) only ever does a single store of 1 to signal shutdown, so no atomic RMW or complex protocol is needed from the C side. Added a comment at the cast site to document this.

GCC 12's headers also contain intrinsic builtins (AVX512-BF16, AMX) that CUDA 12.6's nvcc cannot parse. GCC 11 has none of these issues and matches the gcc-toolset-11 used in the Dockerfile-based build. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

All GCC versions on Ubuntu 24.04 include AMX tile intrinsic headers that CUDA 12.x's nvcc cannot parse. Rather than swapping compilers, pass -mno-amx-tile to the host compiler via --compiler-options in CMakeLists.txt on x86_64 to suppress the offending builtins. Remove the now-unnecessary GCC 11 installation from CI. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

github-actions · 2026-03-23T20:58:39Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions · 2026-03-23T22:18:23Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

The existing -mno-amx-tile workaround is insufficient with GCC 13+ because the AMX intrinsic headers use #pragma GCC target("amx-tile") to force-enable builtins that nvcc cannot parse. Pre-define the include guards for all five GCC AMX headers so they are skipped entirely during nvcc host compilation. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

github-actions · 2026-03-24T00:39:08Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Rename cudaq_backend_t to cudaq_dispatch_path_t to avoid conflicting with existing CUDA-Q "backend" terminology. Move rpc_wire_format.h include to the top of cudaq_realtime.h. Add comment documenting the volatile int* to cuda::std::atomic<int>* cast for shutdown_flag (C ABI boundary). Fail explicitly when cudaHostGetDevicePointer fails instead of silently falling back to the wrong buffer path. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

The struct is an internal runtime context captured by the dispatch loop thread, not a user-facing configuration. The old name suggested it was a variant of cudaq_dispatcher_config_t, which caused confusion in review. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

github-actions · 2026-03-24T17:11:52Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

github-actions · 2026-03-24T18:33:31Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

Embed cudaq_ringbuffer_t, cudaq_dispatcher_config_t, and cudaq_function_table_t as members instead of flattening their fields. This eliminates field duplication, makes the data provenance clear, and simplifies construction to struct copies. Host-specific runtime state (workers, idle_mask, io_ctxs, etc.) remains as direct fields. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

github-actions · 2026-03-25T21:26:29Z

CUDA Quantum Docs Bot: A preview of the documentation can be found here.

wsttiger changed the title ~~Cudaq realtime host dispatcher sandbox~~ Add realtime host dispatcher Mar 18, 2026

bmhowe23 requested review from 1tnguyen and cketcham2333 March 18, 2026 22:51

copy-pr-bot bot temporarily deployed to ghcr-ci March 18, 2026 22:52 Inactive

copy-pr-bot bot temporarily deployed to ghcr-ci March 18, 2026 22:53 Inactive

copy-pr-bot bot temporarily deployed to ghcr-ci March 18, 2026 22:54 Inactive

copy-pr-bot bot temporarily deployed to ghcr-ci March 19, 2026 20:43 Inactive

Squashed commits to remediate DCO

5aa8a95

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

wsttiger added 2 commits March 23, 2026 18:07

wsttiger added 2 commits March 23, 2026 19:32

boschmitt reviewed Mar 23, 2026

View reviewed changes

wsttiger added 2 commits March 23, 2026 20:15

wsttiger added 3 commits March 24, 2026 03:45

Formatting

5bdee00

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add realtime host dispatcher#4180

Add realtime host dispatcher#4180
wsttiger wants to merge 12 commits intoNVIDIA:mainfrom
wsttiger:cudaq_realtime_host_dispatcher_sandbox

wsttiger commented Mar 18, 2026

copy-pr-bot bot commented Mar 18, 2026

bmhowe23 commented Mar 18, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Mar 19, 2026

wsttiger commented Mar 23, 2026 •

edited by github-actions bot

Loading

github-actions bot commented Mar 23, 2026

github-actions bot commented Mar 23, 2026

boschmitt left a comment

boschmitt Mar 23, 2026

wsttiger Mar 25, 2026

boschmitt Mar 23, 2026

wsttiger Mar 24, 2026

boschmitt Mar 23, 2026

wsttiger Mar 24, 2026

boschmitt Mar 23, 2026

wsttiger Mar 24, 2026

Uh oh!

boschmitt Mar 23, 2026

wsttiger Mar 24, 2026

boschmitt Mar 23, 2026

wsttiger Mar 24, 2026

github-actions bot commented Mar 23, 2026

github-actions bot commented Mar 23, 2026

github-actions bot commented Mar 24, 2026

github-actions bot commented Mar 24, 2026

github-actions bot commented Mar 24, 2026

github-actions bot commented Mar 25, 2026

Labels

5 participants

Conversation

wsttiger commented Mar 18, 2026

Summary

copy-pr-bot bot commented Mar 18, 2026

bmhowe23 commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Mar 19, 2026

wsttiger commented Mar 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Mar 23, 2026

github-actions bot commented Mar 23, 2026

boschmitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 23, 2026

github-actions bot commented Mar 23, 2026

github-actions bot commented Mar 24, 2026

github-actions bot commented Mar 24, 2026

github-actions bot commented Mar 24, 2026

github-actions bot commented Mar 25, 2026

Labels

5 participants

bmhowe23 commented Mar 18, 2026 •

edited by github-actions bot

Loading

wsttiger commented Mar 23, 2026 •

edited by github-actions bot

Loading