Skip to content

Feature: IBM Power backend #318

@ashvardanian

Description

@ashvardanian

Describe what you are looking for

Power ISA v3.1 (POWER10/POWER11) has two SIMD tiers: VSX (128-bit, 64 registers) and MMA (512-bit accumulators, 4x4 outer-product tiles). MMA is architecturally similar to AMX but supports a richer type menu (FP16, BF16, FP32, INT8, INT4). VSX has native BF16<->FP32 conversion (xvcvbf16spn / xvcvspbf16), avoiding manual upcast. POWER11 (shipped July 2025) doubles SIMD throughput and quadruples MMA throughput per processor vs POWER10, same ISA. A future Dense Math Facility (DMF) will widen accumulators to 1024-bit DMRs, but toolchain support is still early.

Two separate backends: powervsx for single-pair operations using 128-bit vectors, powermma for batched operations using the 4x4 accumulator tiles.

dot/ and dots/

powervsx: nk_dot_f32_powervsx uses vec_madd (4 f32 lanes). nk_dot_f64_powervsx at 2 lanes. nk_dot_bf16_powervsx loads bf16, converts via xvcvbf16spn to f32, then vec_madd — no bit manipulation needed. Integer dot products use vec_mule / vec_mulo for widening i8->i16->i32 accumulation. Complex variants use vec_xor for sign-flip.

powermma: The primary value. nk_dot_f16x4_powermma loads two 128-bit FP16 vectors, calls xvf16ger2pp to update a 4x4 f32 accumulator, extracts via xxmfacc after the depth loop. Computes 16 dot products simultaneously — natural fit for the dots/ state machine. BF16 uses xvbf16ger2pp, INT8 uses xvi8ger4spp (rank-4, 16 i8 elements per update), INT4 uses xvi4ger8pp (32 nibbles per update). These are the highest-throughput kernels for batched quantized operations on POWER.

spatial/ and spatials/

powervsx: nk_euclidean_f32_powervsx uses vec_sub + vec_madd. BF16 path benefits from native conversion — no manual upcast unlike LASX or pre-Genoa x86. Cosine uses triple-accumulator.

powermma: Batched distance matrices in spatials/ are where MMA shines most. Euclidean (a_i - b_j)^2 = a_i^2 - 2*a_i*b_j + b_j^2 — the cross-term is a rank-k outer product, exactly what xvf32gerpp / xvf16ger2pp computes. Pre-compute per-row/column squared norms, MMA for the cross-term tile, combine in finalize. Same approach as the existing sapphireamx backend. BF16 and INT8 distance matrices get MMA acceleration via xvbf16ger2pp and xvi8ger4spp.

set/ and sets/

powervsx only (MMA not useful here). nk_hamming_u1_powervsx uses vec_xor + vpopcntb (native popcount per byte). Batched sets/ variants replicate accumulators.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions