Skip to content

Conversation

@lshpku
Copy link

@lshpku lshpku commented Aug 21, 2025

PR types

Bug fixes

PR changes

Models

Description

让 pp_stream 等待 attn_backward_dx,解决开启 overlap_p2p_comm 时遇到的 loss 下降速度慢的问题

下图显示了修复前和修复后的等待关系
图片 1

其实我也不知道为什么加这条等待就行,我只是通过二分法定位到是 PP(F) 的问题,然后试着加了等待,然后 loss 就正常了,估计跟跨 stream 分配显存有关,我通过单测发现 Paddle 的跨 stream 分配显存有一些不安全的情况,虽然模型里看起来没有不安全的用法,但也不好说,所以还是保守一点

对性能有一定影响,因为把 PP(F) 推后了,该 PR 还需要改进

正常情况下,单机配置(29 Decoder + 1 MTP),跑200个step,loss应该下降到7.3;在本PR之前开启 overlap_p2p_comm,loss 只能降到8.7;现在开不开都能降到7.3

@paddle-bot
Copy link

paddle-bot bot commented Aug 21, 2025

Thanks for your contribution!

@lshpku lshpku force-pushed the fix-pp-event-wait branch from b78af3d to b6e9841 Compare August 22, 2025 06:56
@lshpku lshpku force-pushed the fix-pp-event-wait branch from b6e9841 to 1b1e63a Compare August 22, 2025 06:57
@github-actions
Copy link

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1 participant