Skip to content

Conversation

@zhangbo9674
Copy link
Contributor

@zhangbo9674 zhangbo9674 commented Nov 4, 2025

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder pip install pre-commit && pre-commit install # Process previous code files separately pre-commit run --file XXXX.py 
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Bug fixes

PR changes

Models

Description

修复 DSV3 post-pretrain 热启不收敛问题,主要修改包括:

  • 修复 Attention 组网代码中 rope_scaling 未正确应用,该问题会导致初始 grad norm 偏高:见 modeling.py
  • 修复 FastCrossEntropyFunction 未采用 FP32 计算以及未正确开始 numeric_stable,该问题对训练后期 grad norm 的稳定性影响很高:见 modeling.py
  • 修复 MoeGate 计算 seq_aux_loss 时未正确对 gates 做归一化,该问题会导致热启 loss 上扬:见 moe_gate.py
  • 修复Moe专家梯度未正确裁剪问题,该问题会导致整体 grad norm 偏高:见 moe_hybrid_parallel_optimizer.py 及 trainer_callback.py
  • 根据论文正确配置参数:num_nextn_predict_lambda=0.1,aux_loss_free_gamma=0.0

修复后收敛效果如下:
image

@paddle-bot
Copy link

paddle-bot bot commented Nov 4, 2025

Thanks for your contribution!

@From00 From00 merged commit 42afde5 into PaddlePaddle:dsv3_dev Nov 4, 2025
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants