Skip to content

Feature: Require "pre-packing" for symmetric GEMM-like kernels #323

@ashvardanian

Description

@ashvardanian

Describe what you are looking for

On-the-fly transposition is prohibitively expensive some ISAs like SME. This leads to our logically cheaper symmetric operations being much slower than packed once, despite requiring less memory traffic. We should consider pre-packing both inputs before passing to our multiplication APIs, but it it introduces issues for some ISAs like AMX, where the swizzling order of A and B arguments is fundamentally different.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions