Conversation
Command Bot: Processing... |
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Command Bot: Processing... |
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Disable GCC 13's AMX tile intrinsics via -mno-amx-tile when building with CUDA 12.x on x86_64, fixing nvcc errors for undefined __builtin_ia32_ldtilecfg/__builtin_ia32_sttilecfg in system headers. Also quote CUDA_NATIVE_ARCH export to prevent semicolons from being interpreted as shell command separators. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Move set(CMAKE_CUDA_FLAGS) after enable_language(CUDA) so that the CUDAFLAGS environment variable is properly picked up by CMake. Without this, the env var was silently ignored because a normal CMake variable shadowed the cache entry before it was initialized. In the CI, export CUDAFLAGS="-Xcompiler -mno-amx-tile" for CUDA 12.x on x86_64 to work around nvcc not supporting GCC 13's AMX tile intrinsics. Also quote the CUDA_NATIVE_ARCH export to prevent semicolons from acting as shell command separators. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Revert the CMakeLists.txt reorder (set CMAKE_CUDA_FLAGS must stay before enable_language(CUDA)) because moving it after caused CMake to add a default CMAKE_CUDA_ARCHITECTURES (sm_52), breaking compilation of DOCA headers that require atomicCAS_block (compute capability 7.0+). Switch from CUDAFLAGS to NVCC_PREPEND_FLAGS environment variable, which is read directly by nvcc at invocation time and bypasses CMake's variable initialization entirely. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Install gcc-12/g++-12 and set CC, CXX, CUDAHOSTCXX for the CUDA 12.x x86_64 build to avoid GCC 13's AMX tile intrinsics that nvcc 12.6 cannot parse. Also keep the CUDA_NATIVE_ARCH quoting fix. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
| } cudaq_tx_status_t; | ||
| | ||
| // RPC wire-format constants — single source of truth in rpc_wire_format.h. | ||
| #include "cudaq/realtime/daemon/dispatcher/rpc_wire_format.h" |
There was a problem hiding this comment.
Seems like weird place the #include a file. Why does it need to be here ?
| typedef enum { | ||
| CUDAQ_BACKEND_DEVICE_KERNEL = 0, | ||
| CUDAQ_BACKEND_HOST_LOOP = 1 | ||
| } cudaq_backend_t; |
There was a problem hiding this comment.
Can we find better names for these? "backend" is already used in CUDA-Q, and means something quite different.
Would it be fair to say that these are distinguishing between having the control path in the GPU vs CPU. That is, CUDAQ_BACKEND_DEVICE_KERNEL indicates that is the GPU that is controlling the NIC, while CUDAQ_BACKEND_HOST_LOOP indicates that the CPU is controlling the NIC?
There was a problem hiding this comment.
Renamed:
cudaq_backend_t → cudaq_dispatch_path_t
CUDAQ_BACKEND_DEVICE_KERNEL → CUDAQ_DISPATCH_PATH_DEVICE
CUDAQ_BACKEND_HOST_LOOP → CUDAQ_DISPATCH_PATH_HOST
.backend field → .dispatch_path
| // RPC framing magic values — sourced from rpc_wire_format.h. | ||
| constexpr std::uint32_t RPC_MAGIC_REQUEST = CUDAQ_RPC_MAGIC_REQUEST; | ||
| constexpr std::uint32_t RPC_MAGIC_RESPONSE = CUDAQ_RPC_MAGIC_RESPONSE; |
There was a problem hiding this comment.
Do we need this indirection ?
There was a problem hiding this comment.
The rpc_wire_format.h header defines the magic values as C preprocessor macros so they work in both C and CUDA device code. The constexpr wrappers here give C++ code typed constants within the cudaq::realtime namespace, which avoids macro pollution and plays nicely with templates. The macros remain the single source of truth — these just forward them into C++ land.
| typedef struct { | ||
| void *rx_flags; ///< opaque cuda::std::atomic<uint64_t>* | ||
| void *tx_flags; ///< opaque cuda::std::atomic<uint64_t>* | ||
| uint8_t *rx_data_host; | ||
| uint8_t *rx_data_dev; | ||
| uint8_t *tx_data_host; | ||
| uint8_t *tx_data_dev; | ||
| size_t tx_stride_sz; | ||
| void **h_mailbox_bank; | ||
| size_t num_slots; | ||
| size_t slot_size; | ||
| cudaq_host_dispatch_worker_t *workers; | ||
| size_t num_workers; | ||
| /// Host-visible function table for lookup by function_id (GRAPH_LAUNCH only; | ||
| /// others dropped). | ||
| cudaq_function_entry_t *function_table; | ||
| size_t function_table_count; | ||
| void *shutdown_flag; ///< opaque cuda::std::atomic<int>* | ||
| uint64_t *stats_counter; | ||
| void *live_dispatched; ///< opaque cuda::std::atomic<uint64_t>* | ||
| void *idle_mask; ///< opaque cuda::std::atomic<uint64_t>*, 1=free 0=busy | ||
| int *inflight_slot_tags; ///< worker_id -> origin FPGA slot for tx_flags | ||
| ///< routing | ||
| | ||
| /// Device view of tx_flags (needed for GraphIOContext.tx_flag). | ||
| /// NULL when tx_flags is already a device-accessible pointer. | ||
| volatile uint64_t *tx_flags_dev; | ||
| | ||
| /// Per-worker GraphIOContext array for separate RX/TX buffer support. | ||
| /// When non-NULL, launch_graph_worker fills a GraphIOContext per dispatch | ||
| /// and writes its device address into h_mailbox_bank[worker_id]. | ||
| /// When NULL, legacy mode: raw RX slot pointer written to mailbox. | ||
| void *io_ctxs_host; ///< host view of GraphIOContext[num_workers] | ||
| void *io_ctxs_dev; ///< device view of same pinned mapped memory | ||
| } cudaq_host_dispatcher_config_t; |
There was a problem hiding this comment.
This struct is odd. We already have cudaq_dispatcher_t and cudaq_dispatcher_config_t. This ones seems this one seems like a redundant flattening of both.
There was a problem hiding this comment.
Renamed to cudaq_host_dispatch_loop_ctx_t to clarify this isn't a user-facing config — it's an internal runtime context captured by value into the dispatch loop thread. It bundles a subset of fields from the ringbuffer, dispatcher config, and function table, plus host-specific runtime state (workers, idle_mask, io_ctxs_*, etc.) that doesn't belong in the public API structs. The device kernel path passes these as individual kernel arguments; the host path needs a struct because std::thread captures a single lambda value. Composing the public types would add unused fields and force verbose access patterns with no real benefit.
| uint8_t *rx_data_dev; | ||
| uint8_t *tx_data_host; | ||
| uint8_t *tx_data_dev; | ||
| size_t tx_stride_sz; |
There was a problem hiding this comment.
I'm not sure if we actually need this struct (see another comment), but, if we do, are we assuming tx_stride_sz == rx_stride_sz? The cudaq_ringbuffer_t struct has both separate, shouldn't that be the case here ?
There was a problem hiding this comment.
slot_size serves as the RX stride — in a contiguous ring buffer, slot capacity equals stride. tx_stride_sz is a separate field because the GraphIOContext path supports TX slots with a different size than RX. So we're not assuming they're equal; they're intentionally tracked independently. Renamed the struct to _ctx_t to make it clearer this is an internal runtime context, not a mirror of cudaq_ringbuffer_t.
| host_config.num_workers = num_workers; | ||
| host_config.function_table = table->entries; | ||
| host_config.function_table_count = table->count; | ||
| host_config.shutdown_flag = (void *)(uintptr_t)shutdown_flag; |
There was a problem hiding this comment.
Shouldn't host_config.shutdown_flag be backed by a cuda::std::atomic<int> ? Here it is taking a volatile int, and thus the underlying storage won't be an atomic. Perhaps this is not a big issue for a shutdown flag, but different writers and readers will be seeing this differently.
There was a problem hiding this comment.
The volatile int* is intentional here — this is a C API boundary (extern "C") so we need plain C types for ABI stability. Internally, the dispatch loop casts it to cuda::std::atomic* to get acquire semantics on the reader side. This is safe because cuda::std::atomic is lock-free and layout-compatible with int on all CUDA-supported platforms. The writer (caller) only ever does a single store of 1 to signal shutdown, so no atomic RMW or complex protocol is needed from the C side. Added a comment at the cast site to document this.
GCC 12's headers also contain intrinsic builtins (AVX512-BF16, AMX) that CUDA 12.6's nvcc cannot parse. GCC 11 has none of these issues and matches the gcc-toolset-11 used in the Dockerfile-based build. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
All GCC versions on Ubuntu 24.04 include AMX tile intrinsic headers that CUDA 12.x's nvcc cannot parse. Rather than swapping compilers, pass -mno-amx-tile to the host compiler via --compiler-options in CMakeLists.txt on x86_64 to suppress the offending builtins. Remove the now-unnecessary GCC 11 installation from CI. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
1 similar comment
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
The existing -mno-amx-tile workaround is insufficient with GCC 13+ because the AMX intrinsic headers use #pragma GCC target("amx-tile") to force-enable builtins that nvcc cannot parse. Pre-define the include guards for all five GCC AMX headers so they are skipped entirely during nvcc host compilation. Signed-off-by: Scott Thornton <wsttiger@gmail.com> | CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Rename cudaq_backend_t to cudaq_dispatch_path_t to avoid conflicting with existing CUDA-Q "backend" terminology. Move rpc_wire_format.h include to the top of cudaq_realtime.h. Add comment documenting the volatile int* to cuda::std::atomic<int>* cast for shutdown_flag (C ABI boundary). Fail explicitly when cudaHostGetDevicePointer fails instead of silently falling back to the wrong buffer path. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The struct is an internal runtime context captured by the dispatch loop thread, not a user-facing configuration. The old name suggested it was a variant of cudaq_dispatcher_config_t, which caused confusion in review. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
1 similar comment
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Embed cudaq_ringbuffer_t, cudaq_dispatcher_config_t, and cudaq_function_table_t as members instead of flattening their fields. This eliminates field duplication, makes the data provenance clear, and simplifies construction to struct copies. Host-specific runtime state (workers, idle_mask, io_ctxs, etc.) remains as direct fields. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
| CUDA Quantum Docs Bot: A preview of the documentation can be found here. |
Summary
This PR refines the host-side dispatcher backend and adds end-to-end test coverage for the GRAPH_LAUNCH dispatch path.
Restrict host dispatcher to GRAPH_LAUNCH only — The host loop now only dispatches
GRAPH_LAUNCHentries;HOST_CALLandDEVICE_CALLslots are dropped (cleared and advanced). Removes the unuseddispatch_host_callpath and updates comments/headers to reflect the GRAPH_LAUNCH-only design.Add host dispatcher tests + external mailbox support — New test file
test_host_dispatcher.cuwith two tests:function_id, and verifies the slot is silently dropped.{0,1,2,3}, and asserts the graph produces{1,2,3,4}in-place.New C API:
cudaq_dispatcher_set_mailbox— Lets callers provide a caller-managed pinned (cudaHostAllocMapped) mailbox. This is required for GRAPH_LAUNCH because the graph must be captured with the device-side mailbox pointer before the dispatcher starts, and the internal allocation (plainnew) is not device-visible. When no external mailbox is provided, the C API falls back to internal allocation (backward compatible).