Skip to content

OOM when training with long sequences despite using dynamic batch size and sequence parallel #1522

@Dodojordi

Description

@Dodojordi

Problem
I need to configure Megatron for training a 4B model with 64K max-response-len using PPO on 4x H200 GPUs.
I'm experiencing Out of Memory (OOM) errors when training with 64K sequences, even with --use-dynamic-batch-size and --sequence-parallel enabled.

My Understanding
I set --max-tokens-per-gpu . I expected:

  • Dynamic batch size would automatically adjust micro_batch_size based on sequence length
  • This would prevent activations from exceeding GPU memory
  • optimizer.step() memory usage should remain stable regardless of global batch size
    However: It seems unlikely to avoid OOM by simply reducing global batch size, since optimizer memory is independent of global batch size.

MyConfig

TP_SIZE=2 PP_SIZE=1 CP_SIZE=1 EP_SIZE=1 ETP_SIZE=1 MAX_LEN=$((1024 * 64)) # 64K MAX_TOKENS_PER_GPU=$((($MAX_LEN / $CP_SIZE) + 1024)) # ~65K ROLLOUT_BATCH_SIZE=16 N_SAMPLES_PER_PROMPT=4 NUM_STEPS_PER_ROLLOUT=4 
--rollout-batch-size $ROLLOUT_BATCH_SIZE \ --n-samples-per-prompt $N_SAMPLES_PER_PROMPT \ --rollout-max-response-len $MAX_LEN \ --colocate \ --actor-num-gpus-per-node 2 \ --tensor-model-parallel-size 2 \ --sequence-parallel \ --pipeline-model-parallel-size 1 \ --context-parallel-size 1 \ --recompute-granularity full \ --recompute-method uniform \ --recompute-num-layers 1 \ --use-dynamic-batch-size \ --max-tokens-per-gpu 65536 \ --transformer-impl transformer_engine \ --bf16 \ --fp8-format e4m3 \ --fp8-recipe blockwise 

Is there any way to avoid the OOM problem during long sequence training?
Any guidance appreciated! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions