[Kernel]Support W4A8 Grouped GEMM on Hopper #29691

czhu-cohere · 2025-11-28T17:03:19Z

Purpose

As title; the benefit of W4A8 is it can use fp8 tensor cores while still maintaining the low memory footprint of W4A16 (with negligible quality loss). In addition there is no Machete-like impl in vLLM for W4A16 grouped gemm so the compute gains should be even larger compared to the current Marlin kernels.

The CUTLASS kernel implementation follows example 69 which uses a LUT-based method for fast INT4 -> FP8 conversion. Similarly to W4A8 dense, we also add per-channel/per-token epilogue.

We have uploaded a W4A8 quantized variant of Qwen3-30B-A3B as an e2e sanity check.

C++ changes

csrc/quantization/cutlass_w4a8/w4a8_utils.cu

refactor int4 reordering/shuffling utilities to a common file that can be shared between dense and grouped gemm w4a8.

csrc/quantization/cutlass_w4a8/get_group_starts.cuh

compute the pointers for each expert for grouped gemm. Most of the logic is the same as the w8a8 version, but w4a8 needs to account for slightly different input types, packed weights, and additional group-wise scales.

csrc/quantization/cutlass_w4a8/w4a8_grouped_mm_entry.cu

main cutlass kernel implementation and dispatch.
when encoding/shuffling the weight matrix (encode_and_reorder_int4b), we construct the layout object and serialize it to a torch tensor so that we can pass it into the grouped gemm at runtime. This is to avoid having to reconstruct the layout itself at runtime, which would incur significant overhead when the number of experts is large.
- the static_assert and layout_width should guarantee that the layout can be serialized to the expected torch tensor dtype/size

csrc/quantization/w8a8/cutlass/moe/moe_data.cu

W4A8 moe can re-use a lot of the helper functions/utilities of W8A8. However, the logic for get_cutlass_moe_mm_problem_sizes is coupled with SwapAB, so I added an argument to allow the user to explicitly specify SwapAB is true/false (for RS GEMM it is always true, since the argument to be dequantized - B - needs to be in the LHS)

Python changes

vllm/model_executor/layers/fused_moe/config.py
vllm/model_executor/layers/fused_moe/modular_kernel.py

I think the previous code assumes that group-wise and channel-wise scales are exclusive, which is not the case here. So I added a field to save the channel scales, in the case where original scales like w1_scale are used for the group-wise scales.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

define weight loading and pre-processing. Some notable differences to FP8 cutlass moe:
- uses different quant methods FusedMoeWeightScaleSupported.GROUP.value and FusedMoeWeightScaleSupported.CHANNEL.value to load group and channel scales respectively
- stricter requirement on shapes due to limitation of cutlass:reorder_tensor explained above (In practice that means for small MoEs like Qwen 30B you may not be able to do TP2)
- s_strides1/2 which store strides for the group scales are stored as shape [num_experts, 2] and dtype int64 since that is what the kernel expects
- b_strides1/2 is returned by the reordering op and saved to pass in at runtime

vllm/model_executor/layers/fused_moe/cutlass_moe.py

The structure is similar to FP8 CUTLASS moe; main differences are 1) the extra arguments for scales, 2) different strides for each input, 3) enforcing SwapAB is true always.

Limitations

Have not implemented/checked compatibility with the different EP options other than default.

Test Plan

kernel correctness test - tests/kernels/quantization/test_cutlass_w4a8_moe.py
e2e eval - lm_eval gsm8k, compare qwen3-30b-a3b w4a16 and w4a8 variants

Test Result

tests/kernels/quantization/test_cutlass_w4a8_moe.py - pass
lm_eval

czhu-cohere/Qwen3-30B-A3B-quantized.w4a8 | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot| 3|flexible-extract| 8|exact_match|↑ |0.9212|± |0.0074| | | |strict-match | 8|exact_match|↑ |0.9007|± |0.0082| RedHatAI/Qwen3-30B-A3B-quantized.w4a16 | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot| 3|flexible-extract| 8|exact_match|↑ |0.9227|± |0.0074| | | |strict-match | 8|exact_match|↑ |0.9052|± |0.0081|

8k prefill for Qwen3-30B-A3B comparing w4a8 and w4a16

w4a8 Median TTFT (ms): 163.17 w4a16 Median TTFT (ms): 246.70

Note that the expert sizes for Qwen3-30B-A3B are quite small and it seems hard to approach peak FP8 TFLOPs with these shapes under the current schedule; larger experts can get higher flops. We leave investigation of this to future work.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds support for W4A8 Grouped GEMM on Hopper GPUs, which is a significant feature for running quantized Mixture-of-Experts models efficiently. The changes span across C++ CUDA kernels, Python bindings, and integration into the model execution layers. The implementation looks solid, with new tests for the functionality. I've identified a couple of critical issues related to data type checks that could cause runtime failures for supported configurations. Addressing these will improve the robustness of the new kernel.

csrc/quantization/cutlass_w4a8/get_group_starts.cuh

vllm/model_executor/layers/fused_moe/cutlass_moe.py

mergify · 2025-12-01T05:55:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @czhu-cohere.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dsikka

The CT integration looks clean to me! Do we have a test model we can add?

czhu-cohere · 2025-12-02T22:25:06Z

@dsikka thanks! We can use https://huggingface.co/czhu-cohere/Qwen3-30B-A3B-quantized.w4a8

LucasWilkinson

LGTM! Thanks; amazing work!

LucasWilkinson · 2025-12-03T21:32:30Z

csrc/quantization/cutlass_w4a8/w4a8_grouped_mm_entry.cu

+ cutlass::gemm::GroupProblemShape<Shape<int, int, int>>; // <M,N,K> per
+ // group
+using MmaType = cutlass::float_e4m3_t;
+using QuantType = cutlass::int4b_t;


any chance this could be extended to mxfp4 too? would be nice if we could make this compatible with gpt-oss (could be done in a future PR)

mxfp4 is an e8m0 scaling factor for every 32 elements? I think there is a group size limitation of 128 here though because the activation is 8 bits

csrc/torch_bindings.cpp

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

czhu-cohere requested review from LucasWilkinson, WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 28, 2025 17:03

mergify bot added ci/build new-model Requests to new models nvidia labels Nov 28, 2025

github-project-automation bot added this to NVIDIA Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

csrc/quantization/cutlass_w4a8/get_group_starts.cuh Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/cutlass_moe.py Show resolved Hide resolved

czhu-cohere force-pushed the czhu/w4a8-moe branch 5 times, most recently from ba25cff to e7cf2d3 Compare November 28, 2025 18:34

mergify bot added the needs-rebase label Dec 1, 2025

czhu-cohere force-pushed the czhu/w4a8-moe branch from 3c1fdfa to e043c39 Compare December 1, 2025 18:06

czhu-cohere changed the title ~~[WIP][Kernel]Support W4A8 Grouped GEMM on Hopper~~ [Kernel]Support W4A8 Grouped GEMM on Hopper Dec 1, 2025

mergify bot removed the needs-rebase label Dec 1, 2025

dsikka reviewed Dec 2, 2025

View reviewed changes

LucasWilkinson approved these changes Dec 3, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 3, 2025

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2025

czhu-cohere added 3 commits December 4, 2025 07:16

so i dont lose work

4c0f376

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

compile

f5417b1

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

it compiles

8d71d83

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

czhu-cohere added 27 commits December 4, 2025 07:16

add get group starts and compile

8f4a920

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

compute reordered layouts at runtime

ec53948

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

comments

146ae59

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

works, no epilogue

7301652

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

cleanup

4944b07

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

cuda stream

7b2d14e

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

tests

db4df08

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

separate mainloop/epilogue args

2039087

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

epilogue + tests

d7943c3

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

debug logs; strideS fix

01901fb

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

workaround for stride/layout

cb08a8c

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

weight loading + integration

8b734d8

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

fix quality

3654e3a

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

force swap_ab option

7ff32b1

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

w4a8 utils

442bff4

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

some cleanup; template

9d023a6

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

add group size

a000183

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

select gemm impl

4fc62f8

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

lint

3890431

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

reformat

abb15ed

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

undo commandr

f68592f

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

rebase fix

f20ba9e

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

test

7b32751

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

undo debug stuff

b3857a8

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

schedules; fix overflow

e26956a

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

layer name

e8dbdc9

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

reoder config

811563e

Signed-off-by: czhu-cohere <conway.zhu@cohere.com>

czhu-cohere force-pushed the czhu/w4a8-moe branch from c08fc15 to 811563e Compare December 4, 2025 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel]Support W4A8 Grouped GEMM on Hopper #29691

[Kernel]Support W4A8 Grouped GEMM on Hopper #29691

czhu-cohere commented Nov 28, 2025 •

edited by github-actions bot

Loading

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Dec 1, 2025

dsikka left a comment

czhu-cohere commented Dec 2, 2025

LucasWilkinson left a comment

LucasWilkinson Dec 3, 2025

czhu-cohere Dec 4, 2025

Uh oh!

Labels

3 participants

Uh oh!

[Kernel]Support W4A8 Grouped GEMM on Hopper #29691

Are you sure you want to change the base?

[Kernel]Support W4A8 Grouped GEMM on Hopper #29691

Conversation

czhu-cohere commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

C++ changes

Python changes

Limitations

Test Plan

Test Result

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Dec 1, 2025

dsikka left a comment

Choose a reason for hiding this comment

czhu-cohere commented Dec 2, 2025

LucasWilkinson left a comment

Choose a reason for hiding this comment

LucasWilkinson Dec 3, 2025

Choose a reason for hiding this comment

czhu-cohere Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

czhu-cohere commented Nov 28, 2025 •

edited by github-actions bot

Loading