- Notifications
You must be signed in to change notification settings - Fork 15.3k
AMDGPU: Select VGPR MFMAs by default #159493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| @llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Matt Arsenault (arsenm) ChangesAGPRs are undesirable since they are only usable by a Patch is 1.30 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159493.diff 28 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp index 908d856d386f5..0077c6915c520 100644 --- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp +++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp @@ -37,7 +37,7 @@ static cl::opt<bool> MFMAVGPRForm( "amdgpu-mfma-vgpr-form", cl::Hidden, cl::desc("Whether to force use VGPR for Opc and Dest of MFMA. If " "unspecified, default to compiler heuristics"), - cl::init(false)); + cl::init(true)); const GCNTargetMachine &getTM(const GCNSubtarget *STI) { const SITargetLowering *TLI = STI->getTargetLowering(); diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll index 5720b882f4e73..2493065806794 100644 --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.mfma.gfx90a.ll @@ -15,59 +15,42 @@ define amdgpu_kernel void @test_mfma_f32_32x32x4bf16_1k(ptr addrspace(1) %arg) # ; GCN: ; %bb.0: ; %bb ; GCN-NEXT: s_load_dwordx2 s[34:35], s[4:5], 0x24 ; GCN-NEXT: s_mov_b64 s[36:37], 1 -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[36:37], s[36:37] op_sel:[0,1] -; GCN-NEXT: s_mov_b32 s38, 2 -; GCN-NEXT: s_mov_b32 s39, s37 +; GCN-NEXT: v_pk_mov_b32 v[32:33], s[36:37], s[36:37] op_sel:[0,1] +; GCN-NEXT: s_mov_b32 s36, 2 +; GCN-NEXT: v_pk_mov_b32 v[34:35], s[36:37], s[36:37] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) ; GCN-NEXT: s_load_dwordx16 s[0:15], s[34:35], 0x0 ; GCN-NEXT: s_load_dwordx16 s[16:31], s[34:35], 0x40 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[38:39], s[38:39] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_accvgpr_write_b32 a0, s0 -; GCN-NEXT: v_accvgpr_write_b32 a16, s16 -; GCN-NEXT: v_accvgpr_write_b32 a1, s1 -; GCN-NEXT: v_accvgpr_write_b32 a2, s2 -; GCN-NEXT: v_accvgpr_write_b32 a3, s3 -; GCN-NEXT: v_accvgpr_write_b32 a4, s4 -; GCN-NEXT: v_accvgpr_write_b32 a5, s5 -; GCN-NEXT: v_accvgpr_write_b32 a6, s6 -; GCN-NEXT: v_accvgpr_write_b32 a7, s7 -; GCN-NEXT: v_accvgpr_write_b32 a8, s8 -; GCN-NEXT: v_accvgpr_write_b32 a9, s9 -; GCN-NEXT: v_accvgpr_write_b32 a10, s10 -; GCN-NEXT: v_accvgpr_write_b32 a11, s11 -; GCN-NEXT: v_accvgpr_write_b32 a12, s12 -; GCN-NEXT: v_accvgpr_write_b32 a13, s13 -; GCN-NEXT: v_accvgpr_write_b32 a14, s14 -; GCN-NEXT: v_accvgpr_write_b32 a15, s15 -; GCN-NEXT: v_accvgpr_write_b32 a17, s17 -; GCN-NEXT: v_accvgpr_write_b32 a18, s18 -; GCN-NEXT: v_accvgpr_write_b32 a19, s19 -; GCN-NEXT: v_accvgpr_write_b32 a20, s20 -; GCN-NEXT: v_accvgpr_write_b32 a21, s21 -; GCN-NEXT: v_accvgpr_write_b32 a22, s22 -; GCN-NEXT: v_accvgpr_write_b32 a23, s23 -; GCN-NEXT: v_accvgpr_write_b32 a24, s24 -; GCN-NEXT: v_accvgpr_write_b32 a25, s25 -; GCN-NEXT: v_accvgpr_write_b32 a26, s26 -; GCN-NEXT: v_accvgpr_write_b32 a27, s27 -; GCN-NEXT: v_accvgpr_write_b32 a28, s28 -; GCN-NEXT: v_accvgpr_write_b32 a29, s29 -; GCN-NEXT: v_accvgpr_write_b32 a30, s30 -; GCN-NEXT: v_accvgpr_write_b32 a31, s31 +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[8:9], s[8:9], s[8:9] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[10:11], s[10:11], s[10:11] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[12:13], s[12:13], s[12:13] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[14:15], s[14:15], s[14:15] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[16:17], s[16:17], s[16:17] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[18:19], s[18:19], s[18:19] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[20:21], s[20:21], s[20:21] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[22:23], s[22:23], s[22:23] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[24:25], s[24:25], s[24:25] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[26:27], s[26:27], s[26:27] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[28:29], s[28:29], s[28:29] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[30:31], s[30:31], s[30:31] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f32_32x32x4bf16_1k a[0:31], v[0:1], v[2:3], a[0:31] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f32_32x32x4bf16_1k v[0:31], v[32:33], v[34:35], v[0:31] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v32, 0 ; GCN-NEXT: s_nop 15 ; GCN-NEXT: s_nop 1 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[34:35] -; GCN-NEXT: global_store_dwordx4 v0, a[4:7], s[34:35] offset:16 -; GCN-NEXT: global_store_dwordx4 v0, a[8:11], s[34:35] offset:32 -; GCN-NEXT: global_store_dwordx4 v0, a[12:15], s[34:35] offset:48 -; GCN-NEXT: global_store_dwordx4 v0, a[16:19], s[34:35] offset:64 -; GCN-NEXT: global_store_dwordx4 v0, a[20:23], s[34:35] offset:80 -; GCN-NEXT: global_store_dwordx4 v0, a[24:27], s[34:35] offset:96 -; GCN-NEXT: global_store_dwordx4 v0, a[28:31], s[34:35] offset:112 +; GCN-NEXT: global_store_dwordx4 v32, v[0:3], s[34:35] +; GCN-NEXT: global_store_dwordx4 v32, v[4:7], s[34:35] offset:16 +; GCN-NEXT: global_store_dwordx4 v32, v[8:11], s[34:35] offset:32 +; GCN-NEXT: global_store_dwordx4 v32, v[12:15], s[34:35] offset:48 +; GCN-NEXT: global_store_dwordx4 v32, v[16:19], s[34:35] offset:64 +; GCN-NEXT: global_store_dwordx4 v32, v[20:23], s[34:35] offset:80 +; GCN-NEXT: global_store_dwordx4 v32, v[24:27], s[34:35] offset:96 +; GCN-NEXT: global_store_dwordx4 v32, v[28:31], s[34:35] offset:112 ; GCN-NEXT: s_endpgm bb: %in.1 = load <32 x float>, ptr addrspace(1) %arg @@ -83,36 +66,28 @@ define amdgpu_kernel void @test_mfma_f32_16x16x4bf16_1k(ptr addrspace(1) %arg) # ; GCN: ; %bb.0: ; %bb ; GCN-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x24 ; GCN-NEXT: s_mov_b64 s[18:19], 1 -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[18:19], s[18:19] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[16:17], s[18:19], s[18:19] op_sel:[0,1] ; GCN-NEXT: s_mov_b32 s18, 2 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[18:19], s[18:19] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[18:19], s[18:19], s[18:19] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) ; GCN-NEXT: s_load_dwordx16 s[0:15], s[16:17], 0x0 ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_accvgpr_write_b32 a0, s0 -; GCN-NEXT: v_accvgpr_write_b32 a1, s1 -; GCN-NEXT: v_accvgpr_write_b32 a2, s2 -; GCN-NEXT: v_accvgpr_write_b32 a3, s3 -; GCN-NEXT: v_accvgpr_write_b32 a4, s4 -; GCN-NEXT: v_accvgpr_write_b32 a5, s5 -; GCN-NEXT: v_accvgpr_write_b32 a6, s6 -; GCN-NEXT: v_accvgpr_write_b32 a7, s7 -; GCN-NEXT: v_accvgpr_write_b32 a8, s8 -; GCN-NEXT: v_accvgpr_write_b32 a9, s9 -; GCN-NEXT: v_accvgpr_write_b32 a10, s10 -; GCN-NEXT: v_accvgpr_write_b32 a11, s11 -; GCN-NEXT: v_accvgpr_write_b32 a12, s12 -; GCN-NEXT: v_accvgpr_write_b32 a13, s13 -; GCN-NEXT: v_accvgpr_write_b32 a14, s14 -; GCN-NEXT: v_accvgpr_write_b32 a15, s15 +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[8:9], s[8:9], s[8:9] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[10:11], s[10:11], s[10:11] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[12:13], s[12:13], s[12:13] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[14:15], s[14:15], s[14:15] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f32_16x16x4bf16_1k a[0:15], v[0:1], v[2:3], a[0:15] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f32_16x16x4bf16_1k v[0:15], v[16:17], v[18:19], v[0:15] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v16, 0 ; GCN-NEXT: s_nop 9 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[16:17] -; GCN-NEXT: global_store_dwordx4 v0, a[4:7], s[16:17] offset:16 -; GCN-NEXT: global_store_dwordx4 v0, a[8:11], s[16:17] offset:32 -; GCN-NEXT: global_store_dwordx4 v0, a[12:15], s[16:17] offset:48 +; GCN-NEXT: global_store_dwordx4 v16, v[0:3], s[16:17] +; GCN-NEXT: global_store_dwordx4 v16, v[4:7], s[16:17] offset:16 +; GCN-NEXT: global_store_dwordx4 v16, v[8:11], s[16:17] offset:32 +; GCN-NEXT: global_store_dwordx4 v16, v[12:15], s[16:17] offset:48 ; GCN-NEXT: s_endpgm bb: %in.1 = load <16 x float>, ptr addrspace(1) %arg @@ -128,21 +103,19 @@ define amdgpu_kernel void @test_mfma_f32_4x4x4bf16_1k(ptr addrspace(1) %arg) #0 ; GCN: ; %bb.0: ; %bb ; GCN-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x24 ; GCN-NEXT: s_mov_b64 s[4:5], 1 -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] ; GCN-NEXT: s_mov_b32 s4, 2 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[4:5], s[4:5] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) ; GCN-NEXT: s_load_dwordx4 s[0:3], s[6:7], 0x0 ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_accvgpr_write_b32 a0, s0 -; GCN-NEXT: v_accvgpr_write_b32 a1, s1 -; GCN-NEXT: v_accvgpr_write_b32 a2, s2 -; GCN-NEXT: v_accvgpr_write_b32 a3, s3 +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f32_4x4x4bf16_1k a[0:3], v[0:1], v[2:3], a[0:3] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f32_4x4x4bf16_1k v[0:3], v[4:5], v[6:7], v[0:3] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v4, 0 ; GCN-NEXT: s_nop 3 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[6:7] +; GCN-NEXT: global_store_dwordx4 v4, v[0:3], s[6:7] ; GCN-NEXT: s_endpgm bb: %in.1 = load <4 x float>, ptr addrspace(1) %arg @@ -158,37 +131,29 @@ define amdgpu_kernel void @test_mfma_f32_32x32x8bf16_1k(ptr addrspace(1) %arg) # ; GCN: ; %bb.0: ; %bb ; GCN-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x24 ; GCN-NEXT: s_mov_b64 s[18:19], 1 -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[18:19], s[18:19] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[16:17], s[18:19], s[18:19] op_sel:[0,1] ; GCN-NEXT: s_mov_b32 s18, 2 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[18:19], s[18:19] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[18:19], s[18:19], s[18:19] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) ; GCN-NEXT: s_load_dwordx16 s[0:15], s[16:17], 0x0 ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_accvgpr_write_b32 a0, s0 -; GCN-NEXT: v_accvgpr_write_b32 a1, s1 -; GCN-NEXT: v_accvgpr_write_b32 a2, s2 -; GCN-NEXT: v_accvgpr_write_b32 a3, s3 -; GCN-NEXT: v_accvgpr_write_b32 a4, s4 -; GCN-NEXT: v_accvgpr_write_b32 a5, s5 -; GCN-NEXT: v_accvgpr_write_b32 a6, s6 -; GCN-NEXT: v_accvgpr_write_b32 a7, s7 -; GCN-NEXT: v_accvgpr_write_b32 a8, s8 -; GCN-NEXT: v_accvgpr_write_b32 a9, s9 -; GCN-NEXT: v_accvgpr_write_b32 a10, s10 -; GCN-NEXT: v_accvgpr_write_b32 a11, s11 -; GCN-NEXT: v_accvgpr_write_b32 a12, s12 -; GCN-NEXT: v_accvgpr_write_b32 a13, s13 -; GCN-NEXT: v_accvgpr_write_b32 a14, s14 -; GCN-NEXT: v_accvgpr_write_b32 a15, s15 +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[8:9], s[8:9], s[8:9] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[10:11], s[10:11], s[10:11] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[12:13], s[12:13], s[12:13] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[14:15], s[14:15], s[14:15] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f32_32x32x8bf16_1k a[0:15], v[0:1], v[2:3], a[0:15] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f32_32x32x8bf16_1k v[0:15], v[16:17], v[18:19], v[0:15] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v16, 0 ; GCN-NEXT: s_nop 15 ; GCN-NEXT: s_nop 1 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[16:17] -; GCN-NEXT: global_store_dwordx4 v0, a[4:7], s[16:17] offset:16 -; GCN-NEXT: global_store_dwordx4 v0, a[8:11], s[16:17] offset:32 -; GCN-NEXT: global_store_dwordx4 v0, a[12:15], s[16:17] offset:48 +; GCN-NEXT: global_store_dwordx4 v16, v[0:3], s[16:17] +; GCN-NEXT: global_store_dwordx4 v16, v[4:7], s[16:17] offset:16 +; GCN-NEXT: global_store_dwordx4 v16, v[8:11], s[16:17] offset:32 +; GCN-NEXT: global_store_dwordx4 v16, v[12:15], s[16:17] offset:48 ; GCN-NEXT: s_endpgm bb: %in.1 = load <16 x float>, ptr addrspace(1) %arg @@ -204,21 +169,19 @@ define amdgpu_kernel void @test_mfma_f32_16x16x16bf16_1k(ptr addrspace(1) %arg) ; GCN: ; %bb.0: ; %bb ; GCN-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x24 ; GCN-NEXT: s_mov_b64 s[4:5], 1 -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] ; GCN-NEXT: s_mov_b32 s4, 2 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[4:5], s[4:5] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) ; GCN-NEXT: s_load_dwordx4 s[0:3], s[6:7], 0x0 ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_accvgpr_write_b32 a0, s0 -; GCN-NEXT: v_accvgpr_write_b32 a1, s1 -; GCN-NEXT: v_accvgpr_write_b32 a2, s2 -; GCN-NEXT: v_accvgpr_write_b32 a3, s3 +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f32_16x16x16bf16_1k a[0:3], v[0:1], v[2:3], a[0:3] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f32_16x16x16bf16_1k v[0:3], v[4:5], v[6:7], v[0:3] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v4, 0 ; GCN-NEXT: s_nop 9 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[6:7] +; GCN-NEXT: global_store_dwordx4 v4, v[0:3], s[6:7] ; GCN-NEXT: s_endpgm bb: %in.1 = load <4 x float>, ptr addrspace(1) %arg @@ -238,12 +201,12 @@ define amdgpu_kernel void @test_mfma_f64_4x4x4f64(ptr addrspace(1) %arg, double ; GCN-NEXT: v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1] ; GCN-NEXT: v_pk_mov_b32 v[2:3], s[6:7], s[6:7] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f64_4x4x4f64 a[0:1], v[0:1], v[2:3], 0 +; GCN-NEXT: v_mfma_f64_4x4x4f64 v[4:5], v[0:1], v[2:3], 0 ; GCN-NEXT: s_nop 3 -; GCN-NEXT: v_mfma_f64_4x4x4f64 a[0:1], v[0:1], v[2:3], a[0:1] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f64_4x4x4f64 v[0:1], v[0:1], v[2:3], v[4:5] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_nop 7 -; GCN-NEXT: global_store_dwordx2 v0, a[0:1], s[0:1] +; GCN-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1] ; GCN-NEXT: s_endpgm bb: %mai.1 = tail call double @llvm.amdgcn.mfma.f64.4x4x4f64(double %a, double %b, double 0.0, i32 0, i32 0, i32 0) @@ -258,25 +221,21 @@ define amdgpu_kernel void @test_mfma_f64_16x16x4f64(ptr addrspace(1) %arg, doubl ; GCN-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x24 ; GCN-NEXT: s_load_dwordx2 s[12:13], s[4:5], 0x34 ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[10:11], s[10:11] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[8:9], s[10:11], s[10:11] op_sel:[0,1] ; GCN-NEXT: s_load_dwordx8 s[0:7], s[8:9], 0x0 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[12:13], s[12:13] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[10:11], s[12:13], s[12:13] op_sel:[0,1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_accvgpr_write_b32 a0, s0 -; GCN-NEXT: v_accvgpr_write_b32 a1, s1 -; GCN-NEXT: v_accvgpr_write_b32 a2, s2 -; GCN-NEXT: v_accvgpr_write_b32 a3, s3 -; GCN-NEXT: v_accvgpr_write_b32 a4, s4 -; GCN-NEXT: v_accvgpr_write_b32 a5, s5 -; GCN-NEXT: v_accvgpr_write_b32 a6, s6 -; GCN-NEXT: v_accvgpr_write_b32 a7, s7 +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f64_16x16x4f64 v[0:7], v[8:9], v[10:11], v[0:7] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v8, 0 ; GCN-NEXT: s_nop 15 ; GCN-NEXT: s_nop 0 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[8:9] -; GCN-NEXT: global_store_dwordx4 v0, a[4:7], s[8:9] offset:16 +; GCN-NEXT: global_store_dwordx4 v8, v[0:3], s[8:9] +; GCN-NEXT: global_store_dwordx4 v8, v[4:7], s[8:9] offset:16 ; GCN-NEXT: s_endpgm bb: %in.1 = load <4 x double>, ptr addrspace(1) %arg @@ -291,16 +250,16 @@ define amdgpu_kernel void @test_mfma_f64_16x16x4f64_splat_imm(ptr addrspace(1) % ; GCN-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24 ; GCN-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34 ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1] -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[6:7], s[6:7] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[8:9], s[2:3], s[2:3] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[10:11], s[6:7], s[6:7] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], 0 -; GCN-NEXT: v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7] cbsz:1 abid:2 blgp:3 -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f64_16x16x4f64 v[0:7], v[8:9], v[10:11], 0 +; GCN-NEXT: v_mfma_f64_16x16x4f64 v[0:7], v[8:9], v[10:11], v[0:7] cbsz:1 abid:2 blgp:3 +; GCN-NEXT: v_mov_b32_e32 v8, 0 ; GCN-NEXT: s_nop 15 ; GCN-NEXT: s_nop 0 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[0:1] -; GCN-NEXT: global_store_dwordx4 v0, a[4:7], s[0:1] offset:16 +; GCN-NEXT: global_store_dwordx4 v8, v[0:3], s[0:1] +; GCN-NEXT: global_store_dwordx4 v8, v[4:7], s[0:1] offset:16 ; GCN-NEXT: s_endpgm bb: %mai.1 = tail call <4 x double> @llvm.amdgcn.mfma.f64.16x16x4f64(double %a, double %b, <4 x double> <double 0.0, double 0.0, double 0.0, double 0.0>, i32 0, i32 0, i32 0) @@ -312,28 +271,26 @@ bb: define amdgpu_kernel void @test_mfma_f64_16x16x4f64_imm(ptr addrspace(1) %arg, double %a, double %b) #0 { ; GCN-LABEL: test_mfma_f64_16x16x4f64_imm: ; GCN: ; %bb.0: ; %bb -; GCN-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24 -; GCN-NEXT: s_load_dwordx2 s[10:11], s[4:5], 0x34 +; GCN-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x24 +; GCN-NEXT: s_load_dwordx2 s[12:13], s[4:5], 0x34 +; GCN-NEXT: s_mov_b64 s[0:1], 0 ; GCN-NEXT: s_mov_b64 s[6:7], 1.0 -; GCN-NEXT: s_mov_b64 s[8:9], 0 -; GCN-NEXT: v_accvgpr_write_b32 a0, s8 +; GCN-NEXT: s_mov_b64 s[2:3], s[0:1] ; GCN-NEXT: s_waitcnt lgkmcnt(0) -; GCN-NEXT: v_pk_mov_b32 v[0:1], s[2:3], s[2:3] op_sel:[0,1] -; GCN-NEXT: v_accvgpr_write_b32 a2, s8 -; GCN-NEXT: v_accvgpr_write_b32 a4, s8 -; GCN-NEXT: v_accvgpr_write_b32 a6, s6 -; GCN-NEXT: v_accvgpr_write_b32 a1, s9 -; GCN-NEXT: v_accvgpr_write_b32 a3, s9 -; GCN-NEXT: v_accvgpr_write_b32 a5, s9 -; GCN-NEXT: v_accvgpr_write_b32 a7, s7 -; GCN-NEXT: v_pk_mov_b32 v[2:3], s[10:11], s[10:11] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[8:9], s[10:11], s[10:11] op_sel:[0,1] +; GCN-NEXT: s_mov_b64 s[4:5], s[0:1] +; GCN-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[2:3], s[2:3], s[2:3] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[4:5], s[4:5], s[4:5] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1] +; GCN-NEXT: v_pk_mov_b32 v[10:11], s[12:13], s[12:13] op_sel:[0,1] ; GCN-NEXT: s_nop 1 -; GCN-NEXT: v_mfma_f64_16x16x4f64 a[0:7], v[0:1], v[2:3], a[0:7] -; GCN-NEXT: v_mov_b32_e32 v0, 0 +; GCN-NEXT: v_mfma_f64_16x16x4f64 v[0:7], v[8:9], v[10:11], v[0:7] +; GCN-NEXT: v_mov_b32_e32 v8, 0 ; GCN-NEXT: s_nop 15 ; GCN-NEXT: s_nop 0 -; GCN-NEXT: global_store_dwordx4 v0, a[0:3], s[0:1] -; GCN-N... [truncated] |
jmmartinez left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.bf16.ll failed on the CI.
Everything else looks good to me.
AGPRs are undesirable since they are only usable by a handful instructions like loads, stores and mfmas and everything else requires copies to/from VGPRs. Using the AGPR form should be a measure of last resort if we must use more than 256 VGPRs.
9be83ee to ec59e3c Compare | cl::desc("Whether to force use VGPR for Opc and Dest of MFMA. If " | ||
| "unspecified, default to compiler heuristics"), | ||
| cl::location(SIMachineFunctionInfo::MFMAVGPRForm), cl::init(false), | ||
| cl::location(SIMachineFunctionInfo::MFMAVGPRForm), cl::init(true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need the flag at all anymore.
Previously, we had a rudimentary heuristic
| (ST.getMaxNumVGPRs(F) <= ST.getAddressableNumArchVGPRs() && |
However, after 476a6ea the heuristic was deleted. So this flag now just overrides the amdgpu-agpr-alloc attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually the flag should be removed. For now some tests are relying on it, and it's a chicken bit.
we had a rudimentary heuristic
I wouldn't call it a heuristic. This system was always trying to just avoid register allocation failures in cases where there won't be AGPRs available. The original implementation crudely tried to detect AGPR usage by checking partially constructed MIR, and over time that became the more reliable IR attribute check
So this flag now just overrides the amdgpu-agpr-alloc attribute?
No. amdgpu-agpr-alloc is only controlling the register budget. All of the predicates were just trying to avoid creating AGPR uses in a function that wouldn't have AGPRs available for the instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the system always was try to use the VGPR form. Any time we were emitting AGPR MFMAs was essentially an edge case or bug
| ; GFX90A-NEXT: global_store_dwordx4 v1, a[12:15], s[34:35] offset:48 | ||
| ; GFX90A-NEXT: global_store_dwordx4 v1, a[0:3], s[34:35] | ||
| ; GFX90A-NEXT: global_store_dwordx4 v1, a[4:7], s[34:35] offset:16 | ||
| ; GFX90A-NEXT: global_store_dwordx4 v33, v[24:27], s[34:35] offset:96 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not suggesting we need to change anything here, but these changes are misleading I think since the test file does not use amdgpu-agpr-alloc (which the attributor will attach)

AGPRs are undesirable since they are only usable by a
handful instructions like loads, stores and mfmas and everything
else requires copies to/from VGPRs. Using the AGPR form should be
a measure of last resort if we must use more than 256 VGPRs.