add mark_not_offload() interface for cpu_offload_v1 by lhb8125 · Pull Request #2770 · NVIDIA/TransformerEngine

lhb8125 · 2026-03-17T05:57:56Z

Description

This PR adds an the interface of mark_not_offload() for cpu_offload v1 version, so MCore's offloading feature could reuse the mark_not_offload() interface to avoid offloading some tensors.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

for more information, see https://pre-commit.ci

ksivaman · 2026-03-17T05:59:44Z

@lhb8125 Could you please update the PR description? @pggPL for review

greptile-apps · 2026-03-17T06:02:10Z

Greptile Summary

This PR wires up mark_not_offload() for the V1 CPU offload code path (NVTE_CPU_OFFLOAD_V1=1). Previously, the function returned early without doing anything when the V1 path was active. The change correctly adds a delegation call in cpu_offload.py and introduces a new mark_not_offload implementation in cpu_offload_v1.py.

However, the V1 implementation contains a logic bug that makes it a no-op:

cpu_offload.py — the delegation fix is correct and straightforward.
cpu_offload_v1.py — the new mark_not_offload sets _TE_do_not_offload = True on the underlying tensors. This attribute is checked exclusively in the non-V1 path (OffloadableLayerState._check_if_offload, cpu_offload.py:463). The V1 path decides whether to offload via tensor_need_offloading_checker_activations, which only checks for hasattr(tensor, "activation_offloading"). As a result, if a tensor was previously marked with mark_activation_offload, calling mark_not_offload will not prevent it from being offloaded in V1. The V1 implementation should instead remove the activation_offloading attribute (matching the opt-in semantics of the V1 offloading mechanism), similar to how mark_activation_offload sets it.

Confidence Score: 2/5

Not safe to merge — the core V1 implementation sets an attribute that is never checked, making mark_not_offload a silent no-op in the V1 path.
The delegation in cpu_offload.py is correct, but the actual V1 implementation in cpu_offload_v1.py uses the wrong attribute (_TE_do_not_offload instead of removing activation_offloading). The V1 offloading decision is entirely gated on activation_offloading, so the new function does not achieve its stated goal.
transformer_engine/pytorch/cpu_offload_v1.py — specifically the new mark_not_offload function (lines 52–60).

Important Files Changed

Filename	Overview
transformer_engine/pytorch/cpu_offload_v1.py	Adds `mark_not_offload` to the V1 code path and imports `prepare_for_saving`/`restore_from_saved`, but the implementation sets `_TE_do_not_offload` which is never consulted by V1's `tensor_need_offloading_checker` — the function is effectively a no-op.
transformer_engine/pytorch/cpu_offload.py	Small, correct change: delegates to `v1_code_path.mark_not_offload` before returning early when `NVTE_CPU_OFFLOAD_V1` is set, fixing the previously silent no-op.

Flowchart

%%{init: {'theme': 'neutral'}}%% flowchart TD A["mark_not_offload(*tensors) called\n(cpu_offload.py)"] --> B{NVTE_CPU_OFFLOAD_V1?} B -- Yes --> C["v1_code_path.mark_not_offload(*tensors)\n(cpu_offload_v1.py)"] B -- No --> D["prepare_for_saving(*tensors)"] D --> E["setattr(tensor, '_TE_do_not_offload', True)"] E --> F["restore_from_saved(tensor_obj, tensors)"] F --> G["Non-V1 push_tensor checks\n_TE_do_not_offload ✅"] C --> H["prepare_for_saving(*tensors)"] H --> I["setattr(tensor, '_TE_do_not_offload', True)"] I --> J["restore_from_saved(tensor_obj, tensors)"] J --> K["V1 tensor_need_offloading_checker\nchecks activation_offloading only ❌\n_TE_do_not_offload is never read"] style K fill:#ffcccc,stroke:#cc0000 style G fill:#ccffcc,stroke:#009900

_{Last reviewed commit: 730d7a1}

greptile-apps · 2026-03-17T06:02:13Z

transformer_engine/pytorch/cpu_offload_v1.py

+def mark_not_offload(*tensors: torch.Tensor):
+ """Marks tensors to prevent them from being offloaded."""
+ tensors, tensor_obj = prepare_for_saving(*tensors)
+
+ for tensor in tensors:
+ if tensor is not None:
+ setattr(tensor, "_TE_do_not_offload", True)
+
+ restore_from_saved(tensor_obj, tensors)


_TE_do_not_offload is never checked in V1 path — function is a no-op

The V1 code path decides whether to offload a tensor using tensor_need_offloading_checker_activations, which checks only for hasattr(tensor, "activation_offloading") (see cpu_offload_v1.py line 738–739). The attribute _TE_do_not_offload set here is never consulted in the V1 path, so calling this function after mark_activation_offload will not prevent offloading.

To confirm: the only place _TE_do_not_offload is read in the whole codebase is cpu_offload.py line 463 (OffloadableLayerState._check_if_offload), which belongs to the non-V1 code path.

For mark_not_offload to actually work in V1, the implementation should mirror how mark_activation_offload operates — i.e., remove (or otherwise gate on) the activation_offloading attribute:

def mark_not_offload(*tensors: torch.Tensor): """Marks tensors to prevent them from being offloaded.""" for tensor in tensors: if tensor is None: continue if type(tensor) in [torch.Tensor, torch.nn.Parameter]: if hasattr(tensor, "activation_offloading"): del tensor.activation_offloading else: data_tensors = tensor.get_data_tensors() for t in data_tensors: if t is not None and hasattr(t, "activation_offloading"): del t.activation_offloading

Alternatively, tensor_need_offloading_checker_activations could be updated to also gate on _TE_do_not_offload, but that would be a more invasive change.

lhb8125 · 2026-03-17T06:02:16Z

/te-ci pytorch L1

add mark_not_offload() interface for cpu_offload_v1

0a1d9ae

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

ksivaman requested a review from pggPL March 17, 2026 05:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

730d7a1

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mark_not_offload() interface for cpu_offload_v1#2770

add mark_not_offload() interface for cpu_offload_v1#2770
lhb8125 wants to merge 2 commits intoNVIDIA:mainfrom
lhb8125:hongbinl/offload_activation_mxfp8_fix

lhb8125 commented Mar 17, 2026 •

edited

Loading

ksivaman commented Mar 17, 2026

greptile-apps bot commented Mar 17, 2026

greptile-apps bot Mar 17, 2026

lhb8125 commented Mar 17, 2026

Labels

2 participants

Conversation

lhb8125 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!