nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth by luiscape · Pull Request #12805 · google/gvisor

luiscape · 2026-03-27T17:02:02Z

Add madvise(MADV_NOHUGEPAGE) on the DMA region in rmAllocOSDescriptor(), right before the existing MADV_POPULATE_WRITE call:

This is a single madvise syscall that:

Tells the kernel to use 4 KB pages for this specific region, preemptively splitting any existing THP backing.
Prevents future THP formation in this region, so pin_user_pages() never encounters compound pages.
Has no effect on the rest of the Sentry's memory — application heap, stacks, and other mappings continue to benefit from THP for fast page faults.

I think this is a good approach because:

The fix applies only to memory regions that will be passed to the nvidia driver for DMA page pinning (NV01_MEMORY_SYSTEM_OS_DESCRIPTOR allocations). These are the only regions where THP causes a problem, and they represent a tiny fraction of total Sentry memory.
The fix works regardless of the host's shmem_enabled setting. When shmem_enabled=never, the MADV_NOHUGEPAGE is a harmless no-op (the pages are already 4 KB). When shmem_enabled=always, it prevents the regression. Operators don't need to choose between fast page faults and fast GPU transfers.
The code already calls MADV_POPULATE_WRITE (added to avoid mmap_lock contention during pin_user_pages()). The MADV_NOHUGEPAGE call slots in naturally before it: first disable THP for the region, then pre-fault the 4 KB pages, then let the nvidia driver pin them without any compound page overhead.
The kernel documentation for MADV_NOHUGEPAGE specifically calls out that it is useful for memory regions where huge pages cause performance regressions due to splitting overhead, which is exactly this case.

Configuration	d2h pageable 64 MB	d2h pageable 256 MB	d2h pageable 1 GB
runc (control)	10.65 GB/s	10.79 GB/s	10.82 GB/s
gVisor (broken, `shmem_enabled=always`)	5.06 GB/s	5.11 GB/s	5.11 GB/s
gVisor (fixed, `shmem_enabled=always`)	10.55 GB/s	10.53 GB/s	10.72 GB/s

The fix fully recovers the lost bandwidth while preserving THP benefits for general workloads.

…tor()`, right before the existing `MADV_POPULATE_WRITE` call: This is a single `madvise` syscall that: 1. **Tells the kernel** to use 4 KB pages for this specific region, preemptively splitting any existing THP backing. 2. **Prevents future THP formation** in this region, so `pin_user_pages()` never encounters compound pages. 3. **Has no effect on the rest of the Sentry's memory** — application heap, stacks, and other mappings continue to benefit from THP for fast page faults. I think this is a good approach because: - The fix applies only to memory regions that will be passed to the nvidia driver for DMA page pinning (`NV01_MEMORY_SYSTEM_OS_DESCRIPTOR` allocations). These are the only regions where THP causes a problem, and they represent a tiny fraction of total Sentry memory. - The fix works regardless of the host's `shmem_enabled` setting. When `shmem_enabled=never`, the `MADV_NOHUGEPAGE` is a harmless no-op (the pages are already 4 KB). When `shmem_enabled=always`, it prevents the regression. Operators don't need to choose between fast page faults and fast GPU transfers. - The code already calls `MADV_POPULATE_WRITE` (added to avoid `mmap_lock` contention during `pin_user_pages()`). The `MADV_NOHUGEPAGE` call slots in naturally before it: first disable THP for the region, then pre-fault the 4 KB pages, then let the nvidia driver pin them without any compound page overhead. - The kernel documentation for `MADV_NOHUGEPAGE` specifically calls out that it is useful for memory regions where huge pages cause performance regressions due to splitting overhead, which is exactly this case. | Configuration | d2h pageable 64 MB | d2h pageable 256 MB | d2h pageable 1 GB | |---|---|---|---| | runc (control) | 10.65 GB/s | 10.79 GB/s | 10.82 GB/s | | gVisor (broken, `shmem_enabled=always`) | 5.06 GB/s | 5.11 GB/s | 5.11 GB/s | | **gVisor (fixed, `shmem_enabled=always`)** | **10.55 GB/s** | **10.53 GB/s** | **10.72 GB/s** | The fix fully recovers the lost bandwidth while preserving THP benefits for general workloads.

luiscape · 2026-03-27T17:42:59Z

@ayushr2 would love your input on this one. We found this to be affecting many of our user workloads.

ayushr2 requested a review from nixprime March 27, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth#12805

nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth#12805
luiscape wants to merge 1 commit intogoogle:masterfrom
luiscape:master

luiscape commented Mar 27, 2026 •

edited

Loading

luiscape commented Mar 27, 2026

Labels

1 participant

Conversation

luiscape commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

luiscape commented Mar 27, 2026

Labels

1 participant

luiscape commented Mar 27, 2026 •

edited

Loading