[ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for AiterFlashAttentionBackend #29887
+234 −54
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Purpose
This PR add assembly paged attention kernel in aiter to
AiterFlashAttentionBackend. We verify this implementation onQwen3-30B-A3B-FP8Mi308 and observed at least about 20% thoughput gain and obvious latency reduction on tpot.This PR add this flag
USING_SHUFFLE_LAYOUTto control whether to enable assembly paged attention, and set it False by default to prevent any unexpected circumstance. This flag will be removed after assembly paged attention fully verified by the users.Test Plan
gsm8k for accuracy
vllm bench for performance
Test Result
accuracy result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.