- Notifications
You must be signed in to change notification settings - Fork 3.7k
Fix mma tensorize error #18528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix mma tensorize error #18528
Conversation
…core and incorrect c index
Summary of ChangesHello @Asuka0630, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical issues affecting MMA tensorization within the TVM framework. Specifically, it corrects an overestimation of shared memory usage and rectifies errors in tensor core accumulation, ensuring that tensorized operations produce accurate results and utilize memory efficiently. These fixes are crucial for reliable performance when using MMA with MultiLevelTilingTensorCore or direct tensorization. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses two critical bugs in MMA tensorization. The first bug, an overestimation of shared memory, is fixed by preventing the threadIdx.x dimension from being included in the cache size calculation for warp-level write caches. The second bug, which caused incorrect accumulation results, is resolved by correcting an offset calculation in the get_index_C function. Both fixes appear correct and are well-motivated. My review includes a couple of minor suggestions to improve code comments for future maintainability.
| bi = i // 8 | ||
| bj = j // 8 | ||
| return (bi // 2) * 2 * stride_b + bi % 2 + bj * 2 | ||
| return ((bi // 2) * 2 * stride_b + bi % 2 + bj * 2) * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly fixes the offset calculation. For better maintainability, please consider adding a comment explaining the multiplication by 2. For example, explaining how it relates to the register layout of the C fragment for mma.sync where each element might occupy two addressable units for the ptx_mma intrinsic.
| // writing C_reindex_m16n8k8_matrixC_shared_dyn is warp execution | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to skip threadIdx.x for write caches is correct. The added comment is helpful, but could be slightly more descriptive to aid future maintenance. Consider expanding it to explain why warp execution implies skipping this dimension, to prevent overestimation of shared memory.
// For write caches that are part of a warp-level execution (e.g., storing MMA results), // the threadIdx.x dimension should not contribute to the cache size calculation, as all // threads in the warp access the same memory region. Skipping it prevents overestimation // of shared memory. continue;
Dear reviewers
Why
When forcing the use of MMA with MultiLevelTilingTensorCore or directly applying tensorization via the script below, the required shared memory size is significantly overestimated compared to the actual usage, at the same time, the accumulated result of mma is also incorrect. This issue stems from two root causes:
MmaToGlobal::Rewrite, an extra threadIdx.x dimension is introduced when calling InsertCacheStage, which confuses the memory analysis and leads to inflated shared memory estimates.get_mma_sync_intrin, the offset computation for fragment C in get_index_C is incorrect, resulting in erroneous accumulation results.This PR addresses both issues to ensure accurate shared memory estimation and correct tensor core accumulation behavior.
How
This PR includes the following fixes:
InsertCacheStagewhen it is not required, to prevent spurious shared memory overestimation and store repeatedly.get_index_Cto ensure accurate accumulation results during tensor core execution.Result
The above script produces results that match those of PyTorch.
** Env **
NVIDIA A100-SXM4-80GB