Skip to content

Conversation

@jinzhen-lin
Copy link
Contributor

@jinzhen-lin jinzhen-lin commented Dec 2, 2025

This PR add marlin kernel support for turing (sm75) (e.g. 2080ti / T4).

  • The Turing architecture lacks the cp.async instruction, but we can still use synchronous instructions to read from global memory and write to shared memory.
  • The Turing architecture does not have the m16n8k16 MMA instruction, but it does have the m16n8k8 instruction. We only need to stack the instruction twice to achieve the same effect as m16n8k16.
  • The throughput of the MMA instruction on the Turing architecture is only half when using an FP32 accumulator compared to an FP16 accumulator (In fact, this is not only for Turing but also for architectures like Ada Lovelace). I also believe that if the final result will not overflow, the probability of the intermediate results overflowing is also very low. Therefore, I have temporarily changed the implementation on Turing to use an FP16 accumulator, while leaving the other architectures unchanged.
  • Supported Weights: GPTQ, AWQ, FP8, NVFP4 (MXFP4 is not supported since it requires BF16 activation)
  • Supported Activation: FP16, INT8 (only for GPTQ and AWQ)

Kernel Benchmark

2080ti + Dense Marlin + GPTQ Channelwise

 n k m torch-fp16 marlin-w4a16 marlin-w4a8-int8 ---- ---- ---- ------------ -------------- ------------------ 1024 2048 1 0.0116937 0.00575462 0.00530092 1024 2048 2 0.0141614 0.00586464 0.00534524 1024 2048 4 0.0132746 0.00596898 0.00544628 1024 2048 8 0.0133064 0.00612971 0.0055609 1024 2048 16 0.013491 0.00690344 0.00596592 1024 2048 32 0.0136423 0.00926577 0.00758568 1024 2048 64 0.0158824 0.0149638 0.0111064 1024 2048 128 0.023759 0.0209582 0.0143538 1024 2048 256 0.0349035 0.0301925 0.0181232 1024 2048 512 0.0500498 0.044179 0.0295197 1024 2048 1024 0.0899547 0.077805 0.0546717 1024 2048 2048 0.157824 0.14364 0.099661 1024 2048 4096 0.313458 0.269902 0.183839 2048 1024 1 0.00999344 0.00581668 0.00541013 2048 1024 2 0.00840535 0.00590156 0.00544462 2048 1024 4 0.00863134 0.00599312 0.0055447 2048 1024 8 0.0113632 0.00618215 0.00561357 2048 1024 16 0.0115773 0.00673722 0.0057703 2048 1024 32 0.0136734 0.00927547 0.00743136 2048 1024 64 0.0149048 0.0150578 0.0111013 2048 1024 128 0.0169164 0.0176558 0.0113795 2048 1024 256 0.0266904 0.0271966 0.0202067 2048 1024 512 0.0745618 0.0369416 0.0243453 2048 1024 1024 0.0841283 0.082084 0.0573731 2048 1024 2048 0.163343 0.1494 0.103495 2048 1024 4096 0.324556 0.282609 0.196996 2048 2048 1 0.0176079 0.00765261 0.00705818 2048 2048 2 0.0225935 0.00773011 0.00711238 2048 2048 4 0.0230041 0.00791787 0.00727277 2048 2048 8 0.0236289 0.00824997 0.00741823 2048 2048 16 0.0245314 0.00918505 0.00762321 2048 2048 32 0.0350614 0.0127286 0.00993129 2048 2048 64 0.0352187 0.0214241 0.0145516 2048 2048 128 0.0361422 0.0293447 0.0186189 2048 2048 256 0.0483972 0.0441569 0.0305642 2048 2048 512 0.144697 0.0781706 0.0554056 2048 2048 1024 0.162762 0.142649 0.099364 2048 2048 2048 0.31577 0.268887 0.182841 2048 2048 4096 0.626884 0.521153 0.349923 2048 4096 1 0.0321256 0.0115125 0.0106321 2048 4096 2 0.0370204 0.0115893 0.0106488 2048 4096 4 0.0372456 0.0118732 0.0108092 2048 4096 8 0.0377987 0.0122913 0.0110519 2048 4096 16 0.0389271 0.0140648 0.0114702 2048 4096 32 0.0470796 0.0197545 0.0145479 2048 4096 64 0.048178 0.0340334 0.0218937 2048 4096 128 0.0619403 0.0586362 0.0369304 2048 4096 256 0.0891697 0.077394 0.0550759 2048 4096 512 0.164789 0.142405 0.0994122 2048 4096 1024 0.319634 0.261681 0.177465 2048 4096 2048 0.622543 0.510484 0.343938 2048 4096 4096 1.23747 0.99095 0.667521 4096 2048 1 0.0319543 0.0097417 0.00963335 4096 2048 2 0.0408578 0.00977983 0.00962789 4096 2048 4 0.0415483 0.00994312 0.0096959 4096 2048 8 0.0427313 0.0102296 0.00988781 4096 2048 16 0.0444555 0.0122868 0.010294 4096 2048 32 0.0423349 0.0184839 0.0131554 4096 2048 64 0.0434447 0.0300758 0.0191552 4096 2048 128 0.0754317 0.0453082 0.0334332 4096 2048 256 0.0863373 0.0801641 0.0581882 4096 2048 512 0.161727 0.14526 0.101503 4096 2048 1024 0.32583 0.273086 0.186877 4096 2048 2048 0.63812 0.525983 0.357648 4096 2048 4096 1.27389 1.04019 0.703852 4096 4096 1 0.0612238 0.020549 0.0211881 4096 4096 2 0.0695407 0.0206079 0.0212637 4096 4096 4 0.070176 0.0206765 0.02131 4096 4096 8 0.071402 0.0208103 0.0212639 4096 4096 16 0.0730782 0.0230811 0.0217029 4096 4096 32 0.0719593 0.0376631 0.0293375 4096 4096 64 0.0736471 0.0592898 0.0393748 4096 4096 128 0.147562 0.0775972 0.0557396 4096 4096 256 0.168878 0.146612 0.107362 4096 4096 512 0.317432 0.262749 0.181462 4096 4096 1024 0.642687 0.510883 0.345919 4096 4096 2048 1.2584 1.00235 0.676912 4096 4096 4096 2.51882 1.99368 1.3387 2048 8192 1 0.16137 0.026266 0.0245548 2048 8192 2 0.0662309 0.0263196 0.024618 2048 8192 4 0.0665458 0.0265283 0.0248257 2048 8192 8 0.0671225 0.0270891 0.0250747 2048 8192 16 0.0683316 0.0298897 0.0256567 2048 8192 32 0.0769621 0.0401793 0.0310113 2048 8192 64 0.0788945 0.0649515 0.0442475 2048 8192 128 0.113392 0.10748 0.0693615 2048 8192 256 0.173364 0.159826 0.10942 2048 8192 512 0.323973 0.299647 0.200056 2048 8192 1024 0.64264 0.53803 0.361474 2048 8192 2048 1.25534 1.02195 0.68934 2048 8192 4096 2.50339 1.96898 1.31185 8192 2048 1 0.0633854 0.0198738 0.0192932 8192 2048 2 0.0635814 0.0199413 0.0193456 8192 2048 4 0.0637044 0.0200488 0.0194411 8192 2048 8 0.0638777 0.0203066 0.0196915 8192 2048 16 0.0642607 0.0218322 0.0200444 8192 2048 32 0.0689704 0.0307074 0.0273373 8192 2048 64 0.0709058 0.0468018 0.0360965 8192 2048 128 0.0898186 0.0811262 0.058757 8192 2048 256 0.166978 0.149034 0.105919 8192 2048 512 0.326734 0.277388 0.192024 8192 2048 1024 0.668465 0.536812 0.368867 8192 2048 2048 1.29818 1.07355 0.738619 8192 2048 4096 2.58765 2.14702 1.4801 
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Turing architecture (sm75) to the Marlin kernels, including both dense and MoE variants. The changes involve adding architecture-specific compilation paths in CMake, providing synchronous implementations for cp_async on older architectures, and using m16n8k8 MMA instructions to emulate m16n8k16. The changes look mostly correct and well-structured. However, I've found a few critical issues: a likely debugging leftover in a preprocessor directive that would cause performance regressions on newer GPUs, and the removal of static_asserts that could hide potential shared memory corruption bugs. There is also a minor correctness issue in a CMake file. Please address these points.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

1 participant