feat(nvproxy): support nvidia-container-runtime csv mode#12794
Draft
a7i wants to merge 1 commit intogoogle:masterfrom
Draft
feat(nvproxy): support nvidia-container-runtime csv mode#12794a7i wants to merge 1 commit intogoogle:masterfrom
a7i wants to merge 1 commit intogoogle:masterfrom
Conversation
Treat GPU detection and legacy hook replication separately: run host prep whenever GPU is requested from the OCI spec, run nvidia-container-cli configure only for the legacy prestart-hook path, synthesize sentry /dev/nvidia* only when spec lacks /dev/nvidiactl, and skip CDI-era NVIDIA prestart hooks (nvidia-cdi-hook, nvidia-ctk, nvidia-container-toolkit). Covers CSV/CDI specs that inject Linux.Devices and mounts without nvidia-container-runtime-hook.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nvproxy previously tied host prep,
nvidia-container-cli configure, and synthetic/dev/nvidia*creation to the presence ofnvidia-container-runtime-hook. CSV mode (and JIT CDI) removes that hook and injects devices/mounts via the OCI spec instead, so those steps were skipped.This change:
nvProxyPreGoferHostSetup) wheneverGPUFunctionalityRequested(including/dev/nvidiactlinLinux.Devices).nvidia-container-cli configureonly on the legacy hook path (GPUFunctionalityNeedsNvidiaContainerCLIConfigure)./dev/nvidiactl.nvidia-cdi-hook,nvidia-ctk,nvidia-container-toolkit(same rationale as the legacy hook).How to test locally
Unit tests (Linux x86_64/arm64 recommended)
bazel test //runsc/specutils:specutils_test --test_output=errorsOn macOS, the full gVisor build may fail on unrelated Darwin issues (
O_LARGEFILE, etc.); use Linux or the project CI.Manual GPU / CSV smoke test (Linux host with NVIDIA driver + toolkit)
Build
runscwith nvproxy (from repo root):make build TARGETS=runsc:runsc # or: bazel build //runsc:runscConfigure NVIDIA runtime (
/etc/nvidia-container-runtime/config.toml):Set
mode = "csv"(orautoif it selects CSV on your platform, e.g. some Jetson/Tegra setups).Under
[nvidia-container-runtime], setruntimesso the first entry is yourrunscwrapper, e.g. a script that runs:Run a GPU container via the NVIDIA shim (not plain
runscalone), so the spec is modified:Or with Docker using NVIDIA as default runtime (see NVIDIA runtime README for csv vs
--gpus).Confirm: container starts,
nvidia-smior a CUDA sample runs, and debug logs show no failure from skipped NVIDIA hooks / duplicate device setup.Risk
Low — scoped to nvproxy detection, hook skipping, and docs; behavior unchanged for legacy hook path.
Related
nvidia-container-toolkitCSV → CDI spec injection.