Skip to content

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357

Open
aoshen524 wants to merge 8 commits intoalibaba:mainfrom
aoshen524:feat/vision-dp-ulysses
Open

feat(vision): add Vision DP for parallel ViT computation across Ulysses SP ranks#357
aoshen524 wants to merge 8 commits intoalibaba:mainfrom
aoshen524:feat/vision-dp-ulysses

Conversation

@aoshen524
Copy link

@aoshen524 aoshen524 commented Feb 16, 2026

Vision Data Parallel: Distribute ViT computation across Ulysses SP ranks

Ported from verl PR #5230, adapted for ROLL's Ulysses SP infrastructure.

Motivation

When using Ulysses Sequence Parallelism (sp_size > 1), the VisionTransformer still processes ALL images on every rank, wasting memory. Vision DP distributes whole images across SP ranks, reducing ViT memory by ~sp_size x.

Key changes

File Change
roll/utils/context_parallel/vision_dp.py Core utilities: load-balanced assignment, tensor slicing, all-gather with gradient fix
roll/utils/context_parallel/monkey_patch.py Integration with idempotency guard, clean unapply support
tests/utils/test_vision_dp_on_cpu.py CPU-only unit tests (28 tests)

Tests

python -m pytest tests/utils/test_vision_dp_on_cpu.py -v # 28 passed
@CLAassistant
Copy link

CLAassistant commented Feb 16, 2026

CLA assistant check
All committers have signed the CLA.

@PanAndy PanAndy requested a review from chocoded February 26, 2026 09:12
aoshen524 and others added 2 commits February 26, 2026 18:24
…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…issues Address reviewer comments (same fixes as verl PR #5230 and AReaL PR #929): 1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate partial sequence gradients before slicing. Fixes silent gradient loss when vision tokens span multiple sequence shard boundaries. 2. **Load-balanced assignment**: Replace count-based chunking with greedy contiguous bin-packing that balances total patch load across ranks. 3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from caller instead of doing all_gather in forward. 4. **Idempotency guard**: Extract `_patch_vision_class()` helper with `_vision_dp_patched` attribute check. Add `_unapply_vision_class()` to properly clear the flag on unapply. 5. **Remove Qwen3-VL-MoE dead code**: Remove unreachable qwen3_vl_moe blocks from apply/unapply (not yet in transformers vl_model_mappings). 6. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward entry point to avoid repeated `.tolist()` GPU→CPU syncs. 7. **Tensor slicing**: Replace Python loop + list append in prepare_local_vision_inputs with contiguous tensor slice using cumsum. 8. **Test improvements**: Rename tests, add load balancing test, add gather_none_group test, use parametrize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aoshen524 aoshen524 force-pushed the feat/vision-dp-ulysses branch from 8227ee1 to 1b13eaf Compare February 26, 2026 09:25
aoshen524 and others added 5 commits March 3, 2026 22:47
…d contiguous guard - Trim verbose docstrings to concise one-liners - Delete dead store ctx.hidden_size (written in forward, never read in backward) - Simplify hidden_size detection: self.config.out_hidden_size - Add requires_grad_() for empty rank to participate in backward all_reduce - Add .contiguous() guard before all_reduce (NCCL requirement) - Reuse get_image_patch_counts in spatial_merge_size==1 path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace isinstance(tuple) check with model attribute detection (hasattr deepstack_merger_list). Empty ranks now create matching empty deepstack tensors and participate in all-gather, preventing NCCL deadlock when num_images < dp_size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `vision_dp: bool = False` to ModelArguments and gate apply_vision_dp_patch() calls in both DeepSpeedInferStrategy and DeepSpeedTrainStrategy behind it. Vision DP is now opt-in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace `expected_patches = end_patch - start_patch` (always-true by Python slicing) with independent cross-check via `get_image_patch_counts(local_grid_thw)` in prepare_local_vision_inputs() - Rename tests to `test_<what>_<condition>_<expected>()` convention - Add missing tests: embedding_counts empty, contiguous coverage, gather same-storage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sync shared utility functions with verl's stricter error handling: - get_image_patch_counts/get_image_embedding_counts: empty grid_thw raises ValueError instead of returning [] - assign_images_to_dp_ranks: validate dp_size > 0, empty patch_counts raises ValueError instead of returning empty lists - prepare_local_vision_inputs: add dp_rank bounds check, use tensor-ops for offset computation (avoid Python-list round-trip), add int() cast - GatherVisionEmbeddings.forward: dp_size<=1 raises RuntimeError, validate all_counts length, max_count==0 raises RuntimeError - GatherVisionEmbeddings.backward: assert dp_size>1, add CUDA check - dp_vision_forward: sp_size<=1 raises RuntimeError, use GatherVisionEmbeddings.apply() directly, add detailed assert messages - Update tests to match: empty→raises, add dp_size/dp_rank validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@guoshengCS guoshengCS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good job!

current_platform.apply_ulysses_patch()
set_upg_manager(ulysses_size=cp_size, rank=global_rank, world_size=world_size)
if self.worker_config.model_args.vision_dp:
apply_vision_dp_patch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems vision_dp also suit to fsdp strategy in the same way andapply_vision_dp_patch have to be called manually since not included in apply_ulysses_patch , could you please support it in fsdp_strategy too

Copy link
Author

@aoshen524 aoshen524 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done.

Call apply_vision_dp_patch() in fsdp2_strategy.py after set_upg_manager(), mirroring the existing pattern in deepspeed_strategy.py. This ensures Vision DP works correctly with FSDP2, not just DeepSpeed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@guoshengCS
Copy link
Contributor

@aoshen524 Could you please provide training curves with and without vision DP

local_embeddings = original_forward(self, local_pixels, local_grid_thw, **kwargs)
else:
# This rank has no images, create empty tensor with correct hidden size
hidden_size = getattr(getattr(self, "config", None), "out_hidden_size", None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(0, h), dtype=hidden_states.dtype, device=hidden_states.device
)
for _ in range(num_deepstack)
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the empty-rank local_deepstack path also call requires_grad_(), similar to empty local_embeddings? Since each deepstack tensor is also passed through GatherVisionEmbeddings, it seems those empty tensors should still participate in autograd so every rank enters the same backward all_reduce. Otherwise, could empty ranks skip the custom backward for deepstack and risk a collective mismatch or hang?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants