- Notifications
You must be signed in to change notification settings - Fork 656
Open
Description
Problem
I need to configure Megatron for training a 4B model with 64K max-response-len using PPO on 4x H200 GPUs.
I'm experiencing Out of Memory (OOM) errors when training with 64K sequences, even with --use-dynamic-batch-size and --sequence-parallel enabled.
My Understanding
I set --max-tokens-per-gpu . I expected:
- Dynamic batch size would automatically adjust micro_batch_size based on sequence length
- This would prevent activations from exceeding GPU memory
- optimizer.step() memory usage should remain stable regardless of global batch size
However: It seems unlikely to avoid OOM by simply reducing global batch size, since optimizer memory is independent of global batch size.
MyConfig
TP_SIZE=2 PP_SIZE=1 CP_SIZE=1 EP_SIZE=1 ETP_SIZE=1 MAX_LEN=$((1024 * 64)) # 64K MAX_TOKENS_PER_GPU=$((($MAX_LEN / $CP_SIZE) + 1024)) # ~65K ROLLOUT_BATCH_SIZE=16 N_SAMPLES_PER_PROMPT=4 NUM_STEPS_PER_ROLLOUT=4 --rollout-batch-size $ROLLOUT_BATCH_SIZE \ --n-samples-per-prompt $N_SAMPLES_PER_PROMPT \ --rollout-max-response-len $MAX_LEN \ --colocate \ --actor-num-gpus-per-node 2 \ --tensor-model-parallel-size 2 \ --sequence-parallel \ --pipeline-model-parallel-size 1 \ --context-parallel-size 1 \ --recompute-granularity full \ --recompute-method uniform \ --recompute-num-layers 1 \ --use-dynamic-batch-size \ --max-tokens-per-gpu 65536 \ --transformer-impl transformer_engine \ --bf16 \ --fp8-format e4m3 \ --fp8-recipe blockwise Is there any way to avoid the OOM problem during long sequence training?
Any guidance appreciated! 🙏
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels