Fix the non-convergence in DSV3 post-pretrain #11161

zhangbo9674 · 2025-11-04T03:29:21Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder pip install pre-commit && pre-commit install # Process previous code files separately pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Bug fixes

PR changes

Models

Description

修复 DSV3 post-pretrain 热启不收敛问题，主要修改包括：

修复 Attention 组网代码中 rope_scaling 未正确应用，该问题会导致初始 grad norm 偏高：见 modeling.py
修复 FastCrossEntropyFunction 未采用 FP32 计算以及未正确开始 numeric_stable，该问题对训练后期 grad norm 的稳定性影响很高：见 modeling.py
修复 MoeGate 计算 seq_aux_loss 时未正确对 gates 做归一化，该问题会导致热启 loss 上扬：见 moe_gate.py
修复Moe专家梯度未正确裁剪问题，该问题会导致整体 grad norm 偏高：见 moe_hybrid_parallel_optimizer.py 及 trainer_callback.py
根据论文正确配置参数：num_nextn_predict_lambda=0.1，aux_loss_free_gamma=0.0

修复后收敛效果如下:

paddle-bot · 2025-11-04T06:01:39Z

Thanks for your contribution!

fix loss bug

e329411

From00 merged commit 42afde5 into PaddlePaddle:dsv3_dev Nov 4, 2025
2 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the non-convergence in DSV3 post-pretrain #11161

Fix the non-convergence in DSV3 post-pretrain #11161

Uh oh!

zhangbo9674 commented Nov 4, 2025 •

edited

Loading

paddle-bot bot commented Nov 4, 2025

Uh oh!

Labels

2 participants

Fix the non-convergence in DSV3 post-pretrain #11161

Fix the non-convergence in DSV3 post-pretrain #11161

Uh oh!

Conversation

zhangbo9674 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

PR types

PR changes

Description

paddle-bot bot commented Nov 4, 2025

Uh oh!

Labels

2 participants

zhangbo9674 commented Nov 4, 2025 •

edited

Loading