Skip to content

nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth#12805

Open
luiscape wants to merge 1 commit intogoogle:masterfrom
luiscape:master
Open

nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth#12805
luiscape wants to merge 1 commit intogoogle:masterfrom
luiscape:master

Conversation

@luiscape
Copy link
Copy Markdown
Contributor

@luiscape luiscape commented Mar 27, 2026

Fixes #12804

Add madvise(MADV_NOHUGEPAGE) on the DMA region in rmAllocOSDescriptor(), right before the existing MADV_POPULATE_WRITE call:

This is a single madvise syscall that:

  1. Tells the kernel to use 4 KB pages for this specific region, preemptively splitting any existing THP backing.
  2. Prevents future THP formation in this region, so pin_user_pages() never encounters compound pages.
  3. Has no effect on the rest of the Sentry's memory — application heap, stacks, and other mappings continue to benefit from THP for fast page faults.

I think this is a good approach because:

  • The fix applies only to memory regions that will be passed to the nvidia driver for DMA page pinning (NV01_MEMORY_SYSTEM_OS_DESCRIPTOR allocations). These are the only regions where THP causes a problem, and they represent a tiny fraction of total Sentry memory.

  • The fix works regardless of the host's shmem_enabled setting. When shmem_enabled=never, the MADV_NOHUGEPAGE is a harmless no-op (the pages are already 4 KB). When shmem_enabled=always, it prevents the regression. Operators don't need to choose between fast page faults and fast GPU transfers.

  • The code already calls MADV_POPULATE_WRITE (added to avoid mmap_lock contention during pin_user_pages()). The MADV_NOHUGEPAGE call slots in naturally before it: first disable THP for the region, then pre-fault the 4 KB pages, then let the nvidia driver pin them without any compound page overhead.

  • The kernel documentation for MADV_NOHUGEPAGE specifically calls out that it is useful for memory regions where huge pages cause performance regressions due to splitting overhead, which is exactly this case.

Configuration d2h pageable 64 MB d2h pageable 256 MB d2h pageable 1 GB
runc (control) 10.65 GB/s 10.79 GB/s 10.82 GB/s
gVisor (broken, shmem_enabled=always) 5.06 GB/s 5.11 GB/s 5.11 GB/s
gVisor (fixed, shmem_enabled=always) 10.55 GB/s 10.53 GB/s 10.72 GB/s

The fix fully recovers the lost bandwidth while preserving THP benefits for general workloads.

…tor()`, right before the existing `MADV_POPULATE_WRITE` call: This is a single `madvise` syscall that: 1. **Tells the kernel** to use 4 KB pages for this specific region, preemptively splitting any existing THP backing. 2. **Prevents future THP formation** in this region, so `pin_user_pages()` never encounters compound pages. 3. **Has no effect on the rest of the Sentry's memory** — application heap, stacks, and other mappings continue to benefit from THP for fast page faults. I think this is a good approach because: - The fix applies only to memory regions that will be passed to the nvidia driver for DMA page pinning (`NV01_MEMORY_SYSTEM_OS_DESCRIPTOR` allocations). These are the only regions where THP causes a problem, and they represent a tiny fraction of total Sentry memory. - The fix works regardless of the host's `shmem_enabled` setting. When `shmem_enabled=never`, the `MADV_NOHUGEPAGE` is a harmless no-op (the pages are already 4 KB). When `shmem_enabled=always`, it prevents the regression. Operators don't need to choose between fast page faults and fast GPU transfers. - The code already calls `MADV_POPULATE_WRITE` (added to avoid `mmap_lock` contention during `pin_user_pages()`). The `MADV_NOHUGEPAGE` call slots in naturally before it: first disable THP for the region, then pre-fault the 4 KB pages, then let the nvidia driver pin them without any compound page overhead. - The kernel documentation for `MADV_NOHUGEPAGE` specifically calls out that it is useful for memory regions where huge pages cause performance regressions due to splitting overhead, which is exactly this case. | Configuration | d2h pageable 64 MB | d2h pageable 256 MB | d2h pageable 1 GB | |---|---|---|---| | runc (control) | 10.65 GB/s | 10.79 GB/s | 10.82 GB/s | | gVisor (broken, `shmem_enabled=always`) | 5.06 GB/s | 5.11 GB/s | 5.11 GB/s | | **gVisor (fixed, `shmem_enabled=always`)** | **10.55 GB/s** | **10.53 GB/s** | **10.72 GB/s** | The fix fully recovers the lost bandwidth while preserving THP benefits for general workloads.
@luiscape
Copy link
Copy Markdown
Contributor Author

@ayushr2 would love your input on this one. We found this to be affecting many of our user workloads.

@ayushr2 ayushr2 requested a review from nixprime March 27, 2026 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant