Feature: Require "pre-packing" for symmetric GEMM-like kernels

Describe what you are looking for

On-the-fly transposition is prohibitively expensive some ISAs like SME. This leads to our logically cheaper symmetric operations being much slower than packed once, despite requiring less memory traffic. We should consider pre-packing both inputs before passing to our multiplication APIs, but it it introduces issues for some ISAs like AMX, where the swizzling order of A and B arguments is fundamentally different.

Can you contribute to the implementation?

I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Require "pre-packing" for `symmetric` GEMM-like kernels #323

Describe what you are looking for

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Require "pre-packing" for symmetric GEMM-like kernels #323

Description

Describe what you are looking for

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Feature: Require "pre-packing" for `symmetric` GEMM-like kernels #323