nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth#12805
Open
luiscape wants to merge 1 commit intogoogle:masterfrom
Open
nvproxy: Fixes performance degradation for GPU→CPU pageable transfer bandwidth#12805luiscape wants to merge 1 commit intogoogle:masterfrom
luiscape wants to merge 1 commit intogoogle:masterfrom
Conversation
…tor()`, right before the existing `MADV_POPULATE_WRITE` call: This is a single `madvise` syscall that: 1. **Tells the kernel** to use 4 KB pages for this specific region, preemptively splitting any existing THP backing. 2. **Prevents future THP formation** in this region, so `pin_user_pages()` never encounters compound pages. 3. **Has no effect on the rest of the Sentry's memory** — application heap, stacks, and other mappings continue to benefit from THP for fast page faults. I think this is a good approach because: - The fix applies only to memory regions that will be passed to the nvidia driver for DMA page pinning (`NV01_MEMORY_SYSTEM_OS_DESCRIPTOR` allocations). These are the only regions where THP causes a problem, and they represent a tiny fraction of total Sentry memory. - The fix works regardless of the host's `shmem_enabled` setting. When `shmem_enabled=never`, the `MADV_NOHUGEPAGE` is a harmless no-op (the pages are already 4 KB). When `shmem_enabled=always`, it prevents the regression. Operators don't need to choose between fast page faults and fast GPU transfers. - The code already calls `MADV_POPULATE_WRITE` (added to avoid `mmap_lock` contention during `pin_user_pages()`). The `MADV_NOHUGEPAGE` call slots in naturally before it: first disable THP for the region, then pre-fault the 4 KB pages, then let the nvidia driver pin them without any compound page overhead. - The kernel documentation for `MADV_NOHUGEPAGE` specifically calls out that it is useful for memory regions where huge pages cause performance regressions due to splitting overhead, which is exactly this case. | Configuration | d2h pageable 64 MB | d2h pageable 256 MB | d2h pageable 1 GB | |---|---|---|---| | runc (control) | 10.65 GB/s | 10.79 GB/s | 10.82 GB/s | | gVisor (broken, `shmem_enabled=always`) | 5.06 GB/s | 5.11 GB/s | 5.11 GB/s | | **gVisor (fixed, `shmem_enabled=always`)** | **10.55 GB/s** | **10.53 GB/s** | **10.72 GB/s** | The fix fully recovers the lost bandwidth while preserving THP benefits for general workloads.
Contributor Author
| @ayushr2 would love your input on this one. We found this to be affecting many of our user workloads. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #12804
Add
madvise(MADV_NOHUGEPAGE)on the DMA region inrmAllocOSDescriptor(), right before the existingMADV_POPULATE_WRITEcall:This is a single
madvisesyscall that:pin_user_pages()never encounters compound pages.I think this is a good approach because:
The fix applies only to memory regions that will be passed to the nvidia driver for DMA page pinning (
NV01_MEMORY_SYSTEM_OS_DESCRIPTORallocations). These are the only regions where THP causes a problem, and they represent a tiny fraction of total Sentry memory.The fix works regardless of the host's
shmem_enabledsetting. Whenshmem_enabled=never, theMADV_NOHUGEPAGEis a harmless no-op (the pages are already 4 KB). Whenshmem_enabled=always, it prevents the regression. Operators don't need to choose between fast page faults and fast GPU transfers.The code already calls
MADV_POPULATE_WRITE(added to avoidmmap_lockcontention duringpin_user_pages()). TheMADV_NOHUGEPAGEcall slots in naturally before it: first disable THP for the region, then pre-fault the 4 KB pages, then let the nvidia driver pin them without any compound page overhead.The kernel documentation for
MADV_NOHUGEPAGEspecifically calls out that it is useful for memory regions where huge pages cause performance regressions due to splitting overhead, which is exactly this case.shmem_enabled=always)shmem_enabled=always)The fix fully recovers the lost bandwidth while preserving THP benefits for general workloads.