- Notifications
You must be signed in to change notification settings - Fork 110
Description
Describe what you are looking for
Power ISA v3.1 (POWER10/POWER11) has two SIMD tiers: VSX (128-bit, 64 registers) and MMA (512-bit accumulators, 4x4 outer-product tiles). MMA is architecturally similar to AMX but supports a richer type menu (FP16, BF16, FP32, INT8, INT4). VSX has native BF16<->FP32 conversion (xvcvbf16spn / xvcvspbf16), avoiding manual upcast. POWER11 (shipped July 2025) doubles SIMD throughput and quadruples MMA throughput per processor vs POWER10, same ISA. A future Dense Math Facility (DMF) will widen accumulators to 1024-bit DMRs, but toolchain support is still early.
Two separate backends: powervsx for single-pair operations using 128-bit vectors, powermma for batched operations using the 4x4 accumulator tiles.
dot/ and dots/
powervsx: nk_dot_f32_powervsx uses vec_madd (4 f32 lanes). nk_dot_f64_powervsx at 2 lanes. nk_dot_bf16_powervsx loads bf16, converts via xvcvbf16spn to f32, then vec_madd — no bit manipulation needed. Integer dot products use vec_mule / vec_mulo for widening i8->i16->i32 accumulation. Complex variants use vec_xor for sign-flip.
powermma: The primary value. nk_dot_f16x4_powermma loads two 128-bit FP16 vectors, calls xvf16ger2pp to update a 4x4 f32 accumulator, extracts via xxmfacc after the depth loop. Computes 16 dot products simultaneously — natural fit for the dots/ state machine. BF16 uses xvbf16ger2pp, INT8 uses xvi8ger4spp (rank-4, 16 i8 elements per update), INT4 uses xvi4ger8pp (32 nibbles per update). These are the highest-throughput kernels for batched quantized operations on POWER.
spatial/ and spatials/
powervsx: nk_euclidean_f32_powervsx uses vec_sub + vec_madd. BF16 path benefits from native conversion — no manual upcast unlike LASX or pre-Genoa x86. Cosine uses triple-accumulator.
powermma: Batched distance matrices in spatials/ are where MMA shines most. Euclidean (a_i - b_j)^2 = a_i^2 - 2*a_i*b_j + b_j^2 — the cross-term is a rank-k outer product, exactly what xvf32gerpp / xvf16ger2pp computes. Pre-compute per-row/column squared norms, MMA for the cross-term tile, combine in finalize. Same approach as the existing sapphireamx backend. BF16 and INT8 distance matrices get MMA acceleration via xvbf16ger2pp and xvi8ger4spp.
set/ and sets/
powervsx only (MMA not useful here). nk_hamming_u1_powervsx uses vec_xor + vpopcntb (native popcount per byte). Batched sets/ variants replicate accumulators.
Can you contribute to the implementation?
- I can contribute
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct