add mark_not_offload() interface for cpu_offload_v1#2770
add mark_not_offload() interface for cpu_offload_v1#2770lhb8125 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR wires up However, the V1 implementation contains a logic bug that makes it a no-op:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%% flowchart TD A["mark_not_offload(*tensors) called\n(cpu_offload.py)"] --> B{NVTE_CPU_OFFLOAD_V1?} B -- Yes --> C["v1_code_path.mark_not_offload(*tensors)\n(cpu_offload_v1.py)"] B -- No --> D["prepare_for_saving(*tensors)"] D --> E["setattr(tensor, '_TE_do_not_offload', True)"] E --> F["restore_from_saved(tensor_obj, tensors)"] F --> G["Non-V1 push_tensor checks\n_TE_do_not_offload ✅"] C --> H["prepare_for_saving(*tensors)"] H --> I["setattr(tensor, '_TE_do_not_offload', True)"] I --> J["restore_from_saved(tensor_obj, tensors)"] J --> K["V1 tensor_need_offloading_checker\nchecks activation_offloading only ❌\n_TE_do_not_offload is never read"] style K fill:#ffcccc,stroke:#cc0000 style G fill:#ccffcc,stroke:#009900 Last reviewed commit: 730d7a1 |
| def mark_not_offload(*tensors: torch.Tensor): | ||
| """Marks tensors to prevent them from being offloaded.""" | ||
| tensors, tensor_obj = prepare_for_saving(*tensors) | ||
| | ||
| for tensor in tensors: | ||
| if tensor is not None: | ||
| setattr(tensor, "_TE_do_not_offload", True) | ||
| | ||
| restore_from_saved(tensor_obj, tensors) |
There was a problem hiding this comment.
_TE_do_not_offload is never checked in V1 path — function is a no-op
The V1 code path decides whether to offload a tensor using tensor_need_offloading_checker_activations, which checks only for hasattr(tensor, "activation_offloading") (see cpu_offload_v1.py line 738–739). The attribute _TE_do_not_offload set here is never consulted in the V1 path, so calling this function after mark_activation_offload will not prevent offloading.
To confirm: the only place _TE_do_not_offload is read in the whole codebase is cpu_offload.py line 463 (OffloadableLayerState._check_if_offload), which belongs to the non-V1 code path.
For mark_not_offload to actually work in V1, the implementation should mirror how mark_activation_offload operates — i.e., remove (or otherwise gate on) the activation_offloading attribute:
def mark_not_offload(*tensors: torch.Tensor): """Marks tensors to prevent them from being offloaded.""" for tensor in tensors: if tensor is None: continue if type(tensor) in [torch.Tensor, torch.nn.Parameter]: if hasattr(tensor, "activation_offloading"): del tensor.activation_offloading else: data_tensors = tensor.get_data_tensors() for t in data_tensors: if t is not None and hasattr(t, "activation_offloading"): del t.activation_offloadingAlternatively, tensor_need_offloading_checker_activations could be updated to also gate on _TE_do_not_offload, but that would be a more invasive change.
| /te-ci pytorch L1 |
Description
This PR adds an the interface of mark_not_offload() for cpu_offload v1 version, so MCore's offloading feature could reuse the mark_not_offload() interface to avoid offloading some tensors.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: