Skip to content

feat(nvproxy): support nvidia-container-runtime csv mode#12794

Draft
a7i wants to merge 1 commit intogoogle:masterfrom
a7i:feat/nvproxy-csv-mode
Draft

feat(nvproxy): support nvidia-container-runtime csv mode#12794
a7i wants to merge 1 commit intogoogle:masterfrom
a7i:feat/nvproxy-csv-mode

Conversation

@a7i
Copy link
Copy Markdown
Contributor

@a7i a7i commented Mar 25, 2026

Summary

nvproxy previously tied host prep, nvidia-container-cli configure, and synthetic /dev/nvidia* creation to the presence of nvidia-container-runtime-hook. CSV mode (and JIT CDI) removes that hook and injects devices/mounts via the OCI spec instead, so those steps were skipped.

This change:

  • Runs host prep (nvProxyPreGoferHostSetup) whenever GPUFunctionalityRequested (including /dev/nvidiactl in Linux.Devices).
  • Runs nvidia-container-cli configure only on the legacy hook path (GPUFunctionalityNeedsNvidiaContainerCLIConfigure).
  • Creates synthetic sentry device nodes only when the spec does not already list /dev/nvidiactl.
  • Skips prestart hooks: nvidia-cdi-hook, nvidia-ctk, nvidia-container-toolkit (same rationale as the legacy hook).
  • Updates GPU user guide to document CSV mode support.

How to test locally

Unit tests (Linux x86_64/arm64 recommended)

bazel test //runsc/specutils:specutils_test --test_output=errors

On macOS, the full gVisor build may fail on unrelated Darwin issues (O_LARGEFILE, etc.); use Linux or the project CI.

Manual GPU / CSV smoke test (Linux host with NVIDIA driver + toolkit)

  1. Build runsc with nvproxy (from repo root):

    make build TARGETS=runsc:runsc # or: bazel build //runsc:runsc
  2. Configure NVIDIA runtime (/etc/nvidia-container-runtime/config.toml):

    • Set mode = "csv" (or auto if it selects CSV on your platform, e.g. some Jetson/Tegra setups).

    • Under [nvidia-container-runtime], set runtimes so the first entry is your runsc wrapper, e.g. a script that runs:

      exec /path/to/runsc --nvproxy "$@"
  3. Run a GPU container via the NVIDIA shim (not plain runsc alone), so the spec is modified:

    sudo nvidia-container-runtime run --bundle /path/to/bundle <container-id>

    Or with Docker using NVIDIA as default runtime (see NVIDIA runtime README for csv vs --gpus).

  4. Confirm: container starts, nvidia-smi or a CUDA sample runs, and debug logs show no failure from skipped NVIDIA hooks / duplicate device setup.

Risk

Low — scoped to nvproxy detection, hook skipping, and docs; behavior unchanged for legacy hook path.

Related

  • Plan: CSV mode aligns with upstream nvidia-container-toolkit CSV → CDI spec injection.
  • Public docs still say legacy-only until this merges and the site is republished.
Treat GPU detection and legacy hook replication separately: run host prep whenever GPU is requested from the OCI spec, run nvidia-container-cli configure only for the legacy prestart-hook path, synthesize sentry /dev/nvidia* only when spec lacks /dev/nvidiactl, and skip CDI-era NVIDIA prestart hooks (nvidia-cdi-hook, nvidia-ctk, nvidia-container-toolkit). Covers CSV/CDI specs that inject Linux.Devices and mounts without nvidia-container-runtime-hook.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant