- Notifications
You must be signed in to change notification settings - Fork 110
Open
Labels
Description
Describe what you are looking for
On-the-fly transposition is prohibitively expensive some ISAs like SME. This leads to our logically cheaper symmetric operations being much slower than packed once, despite requiring less memory traffic. We should consider pre-packing both inputs before passing to our multiplication APIs, but it it introduces issues for some ISAs like AMX, where the swizzling order of A and B arguments is fundamentally different.
Can you contribute to the implementation?
- I can contribute
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable