Feature: IBM Power backend

Describe what you are looking for

Power ISA v3.1 (POWER10/POWER11) has two SIMD tiers: VSX (128-bit, 64 registers) and MMA (512-bit accumulators, 4x4 outer-product tiles). MMA is architecturally similar to AMX but supports a richer type menu (FP16, BF16, FP32, INT8, INT4). VSX has native BF16<->FP32 conversion (xvcvbf16spn / xvcvspbf16), avoiding manual upcast. POWER11 (shipped July 2025) doubles SIMD throughput and quadruples MMA throughput per processor vs POWER10, same ISA. A future Dense Math Facility (DMF) will widen accumulators to 1024-bit DMRs, but toolchain support is still early.

Two separate backends: powervsx for single-pair operations using 128-bit vectors, powermma for batched operations using the 4x4 accumulator tiles.

dot/ and dots/

powervsx: nk_dot_f32_powervsx uses vec_madd (4 f32 lanes). nk_dot_f64_powervsx at 2 lanes. nk_dot_bf16_powervsx loads bf16, converts via xvcvbf16spn to f32, then vec_madd — no bit manipulation needed. Integer dot products use vec_mule / vec_mulo for widening i8->i16->i32 accumulation. Complex variants use vec_xor for sign-flip.

powermma: The primary value. nk_dot_f16x4_powermma loads two 128-bit FP16 vectors, calls xvf16ger2pp to update a 4x4 f32 accumulator, extracts via xxmfacc after the depth loop. Computes 16 dot products simultaneously — natural fit for the dots/ state machine. BF16 uses xvbf16ger2pp, INT8 uses xvi8ger4spp (rank-4, 16 i8 elements per update), INT4 uses xvi4ger8pp (32 nibbles per update). These are the highest-throughput kernels for batched quantized operations on POWER.

spatial/ and spatials/

powervsx: nk_euclidean_f32_powervsx uses vec_sub + vec_madd. BF16 path benefits from native conversion — no manual upcast unlike LASX or pre-Genoa x86. Cosine uses triple-accumulator.

powermma: Batched distance matrices in spatials/ are where MMA shines most. Euclidean (a_i - b_j)^2 = a_i^2 - 2*a_i*b_j + b_j^2 — the cross-term is a rank-k outer product, exactly what xvf32gerpp / xvf16ger2pp computes. Pre-compute per-row/column squared norms, MMA for the cross-term tile, combine in finalize. Same approach as the existing sapphireamx backend. BF16 and INT8 distance matrices get MMA acceleration via xvbf16ger2pp and xvi8ger4spp.

set/ and sets/

powervsx only (MMA not useful here). nk_hamming_u1_powervsx uses vec_xor + vpopcntb (native popcount per byte). Batched sets/ variants replicate accumulators.

Can you contribute to the implementation?

I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: IBM Power backend #318

Describe what you are looking for

dot/ and dots/

spatial/ and spatials/

set/ and sets/

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: IBM Power backend #318

Description

Describe what you are looking for

dot/ and dots/

spatial/ and spatials/

set/ and sets/

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions