release/21.x: [TailDup] Delay aggressive computed-goto taildup to after RegAlloc. (#150911) #151680

fhahn · 2025-08-01T10:13:55Z

Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR.

#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication.

In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes.

This is causing a ~3% performance regression in some workloads using Python 3.12.

This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation.

This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG.

For the case in #106846, I get the same performance with and without this patch on Skylake.

PR: #150911

llvmbot · 2025-08-01T10:14:24Z

@llvm/pr-subscribers-backend-aarch64

Author: Florian Hahn (fhahn)

Changes

Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR.

#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication.

In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes.

This is causing a ~3% performance regression in some workloads using Python 3.12.

This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation.

This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG.

For the case in #106846, I get the same performance with and without this patch on Skylake.

PR: #150911

Patch is 22.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151680.diff

5 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+5-4)
(modified) llvm/lib/CodeGen/TailDuplicator.cpp (+11-7)
(added) llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll (+143)
(renamed) llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir (+23-21)
(added) llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir (+128)

diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h index 938d71dd030e8..9e3d9196cc184 100644 --- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h +++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h @@ -323,10 +323,11 @@ class MachineBasicBlock const MachineFunction *getParent() const { return xParent; } MachineFunction *getParent() { return xParent; } - /// Returns true if the original IR terminator is an `indirectbr`. This - /// typically corresponds to a `goto` in C, rather than jump tables. - bool terminatorIsComputedGoto() const { - return back().isIndirectBranch() && + /// Returns true if the original IR terminator is an `indirectbr` with + /// successor blocks. This typically corresponds to a `goto` in C, rather than + /// jump tables. + bool terminatorIsComputedGotoWithSuccessors() const { + return back().isIndirectBranch() && !succ_empty() && llvm::all_of(successors(), [](const MachineBasicBlock *Succ) { return Succ->isIRBlockAddressTaken(); }); diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp index a88c57fdc165a..5d720fbbf1c61 100644 --- a/llvm/lib/CodeGen/TailDuplicator.cpp +++ b/llvm/lib/CodeGen/TailDuplicator.cpp @@ -604,12 +604,21 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, bool HasComputedGoto = false; if (!TailBB.empty()) { HasIndirectbr = TailBB.back().isIndirectBranch(); - HasComputedGoto = TailBB.terminatorIsComputedGoto(); + HasComputedGoto = TailBB.terminatorIsComputedGotoWithSuccessors(); } if (HasIndirectbr && PreRegAlloc) MaxDuplicateCount = TailDupIndirectBranchSize; + // Allow higher limits when the block has computed-gotos and running after + // register allocation. NB. This basically unfactors computed gotos that were + // factored early on in the compilation process to speed up edge based data + // flow. If we do not unfactor them again, it can seriously pessimize code + // with many computed jumps in the source code, such as interpreters. + // Therefore we do not restrict the computed gotos. + if (HasComputedGoto && !PreRegAlloc) + MaxDuplicateCount = std::max(MaxDuplicateCount, 10u); + // Check the instructions in the block to determine whether tail-duplication // is invalid or unlikely to be profitable. unsigned InstrCount = 0; @@ -663,12 +672,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, // Duplicating a BB which has both multiple predecessors and successors will // may cause huge amount of PHI nodes. If we want to remove this limitation, // we have to address https://github.com/llvm/llvm-project/issues/78578. - // NB. This basically unfactors computed gotos that were factored early on in - // the compilation process to speed up edge based data flow. If we do not - // unfactor them again, it can seriously pessimize code with many computed - // jumps in the source code, such as interpreters. Therefore we do not - // restrict the computed gotos. - if (!HasComputedGoto && TailBB.pred_size() > TailDupPredSize && + if (PreRegAlloc && TailBB.pred_size() > TailDupPredSize && TailBB.succ_size() > TailDupSuccSize) { // If TailBB or any of its successors contains a phi, we may have to add a // large number of additional phis with additional incoming values. diff --git a/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll new file mode 100644 index 0000000000000..381904f776604 --- /dev/null +++ b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll @@ -0,0 +1,143 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -tail-dup-pred-size=2 -tail-dup-succ-size=2 -o - %s | FileCheck %s + +target triple = "arm64-apple-macosx13.0.0" + +@opcode.targets = local_unnamed_addr constant [6 x ptr] [ptr blockaddress(@test_interp, %op1.bb), ptr blockaddress(@test_interp, %op6.bb), ptr blockaddress(@test_interp, %loop.header), ptr blockaddress(@test_interp, %op2.bb), ptr blockaddress(@test_interp, %op4.bb), ptr blockaddress(@test_interp, %op5.bb)] + +define void @test_interp(ptr %frame, ptr %dst) { +; CHECK-LABEL: test_interp: +; CHECK: ; %bb.0: ; %entry +; CHECK-NEXT: stp x24, x23, [sp, #-64]! ; 16-byte Folded Spill +; CHECK-NEXT: stp x22, x21, [sp, #16] ; 16-byte Folded Spill +; CHECK-NEXT: stp x20, x19, [sp, #32] ; 16-byte Folded Spill +; CHECK-NEXT: stp x29, x30, [sp, #48] ; 16-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 64 +; CHECK-NEXT: .cfi_offset w30, -8 +; CHECK-NEXT: .cfi_offset w29, -16 +; CHECK-NEXT: .cfi_offset w19, -24 +; CHECK-NEXT: .cfi_offset w20, -32 +; CHECK-NEXT: .cfi_offset w21, -40 +; CHECK-NEXT: .cfi_offset w22, -48 +; CHECK-NEXT: .cfi_offset w23, -56 +; CHECK-NEXT: .cfi_offset w24, -64 +; CHECK-NEXT: Lloh0: +; CHECK-NEXT: adrp x21, _opcode.targets@PAGE +; CHECK-NEXT: Lloh1: +; CHECK-NEXT: add x21, x21, _opcode.targets@PAGEOFF +; CHECK-NEXT: mov x24, xzr +; CHECK-NEXT: add x8, x21, xzr, lsl #3 +; CHECK-NEXT: mov x19, x1 +; CHECK-NEXT: mov x20, x0 +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: mov w22, #1 ; =0x1 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp0: ; Block address taken +; CHECK-NEXT: LBB0_1: ; %loop.header +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp1: ; Block address taken +; CHECK-NEXT: LBB0_2: ; %op1.bb +; CHECK-NEXT: str xzr, [x19] +; CHECK-NEXT: Ltmp2: ; Block address taken +; CHECK-NEXT: LBB0_3: ; %op6.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: ldr x0, [x20, #-8]! +; CHECK-NEXT: ldr x8, [x0, #8] +; CHECK-NEXT: str x22, [x0] +; CHECK-NEXT: ldr x8, [x8, #48] +; CHECK-NEXT: blr x8 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp3: ; Block address taken +; CHECK-NEXT: LBB0_4: ; %op2.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp4: ; Block address taken +; CHECK-NEXT: LBB0_5: ; %op4.bb +; CHECK-NEXT: Ltmp5: ; Block address taken +; CHECK-NEXT: LBB0_6: ; %op5.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: ldur x8, [x23, #12] +; CHECK-NEXT: ldur x9, [x20, #-8] +; CHECK-NEXT: add x23, x23, #20 +; CHECK-NEXT: stp x8, x9, [x20, #-8] +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x20, x20, #8 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: .loh AdrpAdd Lloh0, Lloh1 +entry: + br label %loop.header + +loop.header: + %iv = phi i64 [ 0, %entry ], [ %iv.next, %op1.bb ], [ %iv.next, %op2.bb ], [ %iv.next, %op4.bb ], [ %iv.next, %op5.bb ], [ %iv.next, %op6.bb ], [ %iv.next, %loop.header ] + %stack.pointer = phi ptr [ %frame, %entry ], [ %stack.8, %op1.bb ], [ null, %op2.bb ], [ %stack.next, %op4.bb ], [ %stack.next.2, %op5.bb ], [ %stack.4, %op6.bb ], [ null, %loop.header ] + %next.instr = phi ptr [ null, %entry ], [ %next.instr, %op1.bb ], [ null, %op2.bb ], [ %next.instr.20, %op4.bb ], [ %next.instr.21, %op5.bb ], [ %next.instr, %op6.bb ], [ null, %loop.header ] + %iv.next = add i64 %iv, 1 + %next_op = getelementptr [6 x ptr], ptr @opcode.targets, i64 0, i64 %iv + indirectbr ptr %next_op, [label %op1.bb, label %op6.bb, label %loop.header, label %op2.bb, label %op4.bb, label %op5.bb] + +op1.bb: + store ptr null, ptr %dst, align 8 + %stack.8 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.0 = load ptr, ptr %stack.8, align 8 + store i64 1, ptr %l.0, align 8 + %gep.0 = getelementptr i8, ptr %l.0, i64 8 + %l.1 = load ptr, ptr %gep.0, align 8 + %gep.1 = getelementptr i8, ptr %l.1, i64 48 + %l.2 = load ptr, ptr %gep.1, align 8 + tail call void %l.2(ptr nonnull %l.0) + br label %loop.header + +op2.bb: + store ptr %next.instr, ptr %dst, align 8 + br label %loop.header + +op4.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.20 = getelementptr i8, ptr %next.instr, i64 20 + %stack.2 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.3 = load ptr, ptr %stack.2, align 8 + %next.instr.12 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.12.val = load ptr, ptr %next.instr.12, align 2 + store ptr %next.instr.12.val, ptr %stack.2, align 8 + store ptr %l.3, ptr %stack.pointer, align 8 + %stack.next = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op5.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.21 = getelementptr i8, ptr %next.instr, i64 20 + %stack.3 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.4 = load ptr, ptr %stack.3, align 8 + %next.instr.2 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.2.val = load ptr, ptr %next.instr.2, align 2 + store ptr %next.instr.2.val, ptr %stack.3, align 8 + store ptr %l.4, ptr %stack.pointer, align 8 + %stack.next.2 = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op6.bb: + %stack.4 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.5 = load ptr, ptr %stack.4, align 8 + store i64 1, ptr %l.5, align 8 + %gep.5 = getelementptr i8, ptr %l.5, i64 8 + %l.6 = load ptr, ptr %gep.5, align 8 + %gep.6 = getelementptr i8, ptr %l.6, i64 48 + %l.7 = load ptr, ptr %gep.6, align 8 + tail call void %l.7(ptr nonnull %l.5) + br label %loop.header +} diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir similarity index 93% rename from llvm/test/CodeGen/X86/tail-dup-computed-goto.mir rename to llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir index 17de405928d37..0f2896463a8af 100644 --- a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir +++ b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir @@ -1,6 +1,8 @@ # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 # RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s -# Check that only the computed goto is not be restrict by tail-dup-pred-size and tail-dup-succ-size. +# +# Check that only the computed goto and others are restricted by tail-dup-pred-size and tail-dup-succ-size. +# --- | @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] declare i64 @f0() @@ -30,54 +32,54 @@ tracksRegLiveness: true body: | ; CHECK-LABEL: name: computed_goto ; CHECK: bb.0: - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64_nosp = COPY [[COPY2]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY6]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.4.bb4 (ir-block-address-taken %ir-block.bb4): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gr64_nosp = COPY [[COPY8]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY8]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.5: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: [[PHI:%[0-9]+]]:gr64_nosp = PHI [[COPY]], %bb.0, [[COPY4]], %bb.4, [[COPY3]], %bb.3, [[COPY2]], %bb.2, [[COPY1]], %bb.1 + ; CHECK-NEXT: JMP64m $noreg, 8, [[PHI]], @computed_goto.dispatch, $noreg bb.0: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax diff --git a/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir new file mode 100644 index 0000000000000..e272e7ee3cb0f --- /dev/null +++ b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir @@ -0,0 +1,128 @@ +# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 +# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s +# +# Check that only the computed gotos are duplicated aggressively. +# +--- | + @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] + declare i64 @f0() + declare i64 @f1() + declare i64 @f2() + declare i64 @f3() + declare i64 @f4() + declare i64 @f5() + define void @computed_goto() { + start: + ret void + bb1: + ret void + bb2: + ret void + bb3: + ret void + bb4: + ret void + } + define void @jump_table() { ret void } + define void @jump_table_pic() { ret void } +... +--- +name: computed_goto +alignment: 1 +exposesReturnsTwice: false +legalized: false +regBankSelected: false +selected: false +failedISel: false +tracksRegLiveness: true +body: | + ; CHECK-LABEL: name: computed_goto + ; CHECK: bb.0: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY3]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64 = COPY $rax + ;... [truncated]

llvmbot · 2025-08-01T10:14:25Z

@llvm/pr-subscribers-backend-x86

Author: Florian Hahn (fhahn)

Changes

Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR.

#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication.

In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes.

This is causing a ~3% performance regression in some workloads using Python 3.12.

This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation.

This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG.

For the case in #106846, I get the same performance with and without this patch on Skylake.

PR: #150911

Patch is 22.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151680.diff

5 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+5-4)
(modified) llvm/lib/CodeGen/TailDuplicator.cpp (+11-7)
(added) llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll (+143)
(renamed) llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir (+23-21)
(added) llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir (+128)

diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h index 938d71dd030e8..9e3d9196cc184 100644 --- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h +++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h @@ -323,10 +323,11 @@ class MachineBasicBlock const MachineFunction *getParent() const { return xParent; } MachineFunction *getParent() { return xParent; } - /// Returns true if the original IR terminator is an `indirectbr`. This - /// typically corresponds to a `goto` in C, rather than jump tables. - bool terminatorIsComputedGoto() const { - return back().isIndirectBranch() && + /// Returns true if the original IR terminator is an `indirectbr` with + /// successor blocks. This typically corresponds to a `goto` in C, rather than + /// jump tables. + bool terminatorIsComputedGotoWithSuccessors() const { + return back().isIndirectBranch() && !succ_empty() && llvm::all_of(successors(), [](const MachineBasicBlock *Succ) { return Succ->isIRBlockAddressTaken(); }); diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp index a88c57fdc165a..5d720fbbf1c61 100644 --- a/llvm/lib/CodeGen/TailDuplicator.cpp +++ b/llvm/lib/CodeGen/TailDuplicator.cpp @@ -604,12 +604,21 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, bool HasComputedGoto = false; if (!TailBB.empty()) { HasIndirectbr = TailBB.back().isIndirectBranch(); - HasComputedGoto = TailBB.terminatorIsComputedGoto(); + HasComputedGoto = TailBB.terminatorIsComputedGotoWithSuccessors(); } if (HasIndirectbr && PreRegAlloc) MaxDuplicateCount = TailDupIndirectBranchSize; + // Allow higher limits when the block has computed-gotos and running after + // register allocation. NB. This basically unfactors computed gotos that were + // factored early on in the compilation process to speed up edge based data + // flow. If we do not unfactor them again, it can seriously pessimize code + // with many computed jumps in the source code, such as interpreters. + // Therefore we do not restrict the computed gotos. + if (HasComputedGoto && !PreRegAlloc) + MaxDuplicateCount = std::max(MaxDuplicateCount, 10u); + // Check the instructions in the block to determine whether tail-duplication // is invalid or unlikely to be profitable. unsigned InstrCount = 0; @@ -663,12 +672,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, // Duplicating a BB which has both multiple predecessors and successors will // may cause huge amount of PHI nodes. If we want to remove this limitation, // we have to address https://github.com/llvm/llvm-project/issues/78578. - // NB. This basically unfactors computed gotos that were factored early on in - // the compilation process to speed up edge based data flow. If we do not - // unfactor them again, it can seriously pessimize code with many computed - // jumps in the source code, such as interpreters. Therefore we do not - // restrict the computed gotos. - if (!HasComputedGoto && TailBB.pred_size() > TailDupPredSize && + if (PreRegAlloc && TailBB.pred_size() > TailDupPredSize && TailBB.succ_size() > TailDupSuccSize) { // If TailBB or any of its successors contains a phi, we may have to add a // large number of additional phis with additional incoming values. diff --git a/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll new file mode 100644 index 0000000000000..381904f776604 --- /dev/null +++ b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll @@ -0,0 +1,143 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -tail-dup-pred-size=2 -tail-dup-succ-size=2 -o - %s | FileCheck %s + +target triple = "arm64-apple-macosx13.0.0" + +@opcode.targets = local_unnamed_addr constant [6 x ptr] [ptr blockaddress(@test_interp, %op1.bb), ptr blockaddress(@test_interp, %op6.bb), ptr blockaddress(@test_interp, %loop.header), ptr blockaddress(@test_interp, %op2.bb), ptr blockaddress(@test_interp, %op4.bb), ptr blockaddress(@test_interp, %op5.bb)] + +define void @test_interp(ptr %frame, ptr %dst) { +; CHECK-LABEL: test_interp: +; CHECK: ; %bb.0: ; %entry +; CHECK-NEXT: stp x24, x23, [sp, #-64]! ; 16-byte Folded Spill +; CHECK-NEXT: stp x22, x21, [sp, #16] ; 16-byte Folded Spill +; CHECK-NEXT: stp x20, x19, [sp, #32] ; 16-byte Folded Spill +; CHECK-NEXT: stp x29, x30, [sp, #48] ; 16-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 64 +; CHECK-NEXT: .cfi_offset w30, -8 +; CHECK-NEXT: .cfi_offset w29, -16 +; CHECK-NEXT: .cfi_offset w19, -24 +; CHECK-NEXT: .cfi_offset w20, -32 +; CHECK-NEXT: .cfi_offset w21, -40 +; CHECK-NEXT: .cfi_offset w22, -48 +; CHECK-NEXT: .cfi_offset w23, -56 +; CHECK-NEXT: .cfi_offset w24, -64 +; CHECK-NEXT: Lloh0: +; CHECK-NEXT: adrp x21, _opcode.targets@PAGE +; CHECK-NEXT: Lloh1: +; CHECK-NEXT: add x21, x21, _opcode.targets@PAGEOFF +; CHECK-NEXT: mov x24, xzr +; CHECK-NEXT: add x8, x21, xzr, lsl #3 +; CHECK-NEXT: mov x19, x1 +; CHECK-NEXT: mov x20, x0 +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: mov w22, #1 ; =0x1 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp0: ; Block address taken +; CHECK-NEXT: LBB0_1: ; %loop.header +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp1: ; Block address taken +; CHECK-NEXT: LBB0_2: ; %op1.bb +; CHECK-NEXT: str xzr, [x19] +; CHECK-NEXT: Ltmp2: ; Block address taken +; CHECK-NEXT: LBB0_3: ; %op6.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: ldr x0, [x20, #-8]! +; CHECK-NEXT: ldr x8, [x0, #8] +; CHECK-NEXT: str x22, [x0] +; CHECK-NEXT: ldr x8, [x8, #48] +; CHECK-NEXT: blr x8 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp3: ; Block address taken +; CHECK-NEXT: LBB0_4: ; %op2.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp4: ; Block address taken +; CHECK-NEXT: LBB0_5: ; %op4.bb +; CHECK-NEXT: Ltmp5: ; Block address taken +; CHECK-NEXT: LBB0_6: ; %op5.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: ldur x8, [x23, #12] +; CHECK-NEXT: ldur x9, [x20, #-8] +; CHECK-NEXT: add x23, x23, #20 +; CHECK-NEXT: stp x8, x9, [x20, #-8] +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x20, x20, #8 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: .loh AdrpAdd Lloh0, Lloh1 +entry: + br label %loop.header + +loop.header: + %iv = phi i64 [ 0, %entry ], [ %iv.next, %op1.bb ], [ %iv.next, %op2.bb ], [ %iv.next, %op4.bb ], [ %iv.next, %op5.bb ], [ %iv.next, %op6.bb ], [ %iv.next, %loop.header ] + %stack.pointer = phi ptr [ %frame, %entry ], [ %stack.8, %op1.bb ], [ null, %op2.bb ], [ %stack.next, %op4.bb ], [ %stack.next.2, %op5.bb ], [ %stack.4, %op6.bb ], [ null, %loop.header ] + %next.instr = phi ptr [ null, %entry ], [ %next.instr, %op1.bb ], [ null, %op2.bb ], [ %next.instr.20, %op4.bb ], [ %next.instr.21, %op5.bb ], [ %next.instr, %op6.bb ], [ null, %loop.header ] + %iv.next = add i64 %iv, 1 + %next_op = getelementptr [6 x ptr], ptr @opcode.targets, i64 0, i64 %iv + indirectbr ptr %next_op, [label %op1.bb, label %op6.bb, label %loop.header, label %op2.bb, label %op4.bb, label %op5.bb] + +op1.bb: + store ptr null, ptr %dst, align 8 + %stack.8 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.0 = load ptr, ptr %stack.8, align 8 + store i64 1, ptr %l.0, align 8 + %gep.0 = getelementptr i8, ptr %l.0, i64 8 + %l.1 = load ptr, ptr %gep.0, align 8 + %gep.1 = getelementptr i8, ptr %l.1, i64 48 + %l.2 = load ptr, ptr %gep.1, align 8 + tail call void %l.2(ptr nonnull %l.0) + br label %loop.header + +op2.bb: + store ptr %next.instr, ptr %dst, align 8 + br label %loop.header + +op4.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.20 = getelementptr i8, ptr %next.instr, i64 20 + %stack.2 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.3 = load ptr, ptr %stack.2, align 8 + %next.instr.12 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.12.val = load ptr, ptr %next.instr.12, align 2 + store ptr %next.instr.12.val, ptr %stack.2, align 8 + store ptr %l.3, ptr %stack.pointer, align 8 + %stack.next = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op5.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.21 = getelementptr i8, ptr %next.instr, i64 20 + %stack.3 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.4 = load ptr, ptr %stack.3, align 8 + %next.instr.2 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.2.val = load ptr, ptr %next.instr.2, align 2 + store ptr %next.instr.2.val, ptr %stack.3, align 8 + store ptr %l.4, ptr %stack.pointer, align 8 + %stack.next.2 = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op6.bb: + %stack.4 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.5 = load ptr, ptr %stack.4, align 8 + store i64 1, ptr %l.5, align 8 + %gep.5 = getelementptr i8, ptr %l.5, i64 8 + %l.6 = load ptr, ptr %gep.5, align 8 + %gep.6 = getelementptr i8, ptr %l.6, i64 48 + %l.7 = load ptr, ptr %gep.6, align 8 + tail call void %l.7(ptr nonnull %l.5) + br label %loop.header +} diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir similarity index 93% rename from llvm/test/CodeGen/X86/tail-dup-computed-goto.mir rename to llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir index 17de405928d37..0f2896463a8af 100644 --- a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir +++ b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir @@ -1,6 +1,8 @@ # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 # RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s -# Check that only the computed goto is not be restrict by tail-dup-pred-size and tail-dup-succ-size. +# +# Check that only the computed goto and others are restricted by tail-dup-pred-size and tail-dup-succ-size. +# --- | @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] declare i64 @f0() @@ -30,54 +32,54 @@ tracksRegLiveness: true body: | ; CHECK-LABEL: name: computed_goto ; CHECK: bb.0: - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64_nosp = COPY [[COPY2]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY6]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.4.bb4 (ir-block-address-taken %ir-block.bb4): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gr64_nosp = COPY [[COPY8]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY8]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.5: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: [[PHI:%[0-9]+]]:gr64_nosp = PHI [[COPY]], %bb.0, [[COPY4]], %bb.4, [[COPY3]], %bb.3, [[COPY2]], %bb.2, [[COPY1]], %bb.1 + ; CHECK-NEXT: JMP64m $noreg, 8, [[PHI]], @computed_goto.dispatch, $noreg bb.0: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax diff --git a/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir new file mode 100644 index 0000000000000..e272e7ee3cb0f --- /dev/null +++ b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir @@ -0,0 +1,128 @@ +# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 +# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s +# +# Check that only the computed gotos are duplicated aggressively. +# +--- | + @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] + declare i64 @f0() + declare i64 @f1() + declare i64 @f2() + declare i64 @f3() + declare i64 @f4() + declare i64 @f5() + define void @computed_goto() { + start: + ret void + bb1: + ret void + bb2: + ret void + bb3: + ret void + bb4: + ret void + } + define void @jump_table() { ret void } + define void @jump_table_pic() { ret void } +... +--- +name: computed_goto +alignment: 1 +exposesReturnsTwice: false +legalized: false +regBankSelected: false +selected: false +failedISel: false +tracksRegLiveness: true +body: | + ; CHECK-LABEL: name: computed_goto + ; CHECK: bb.0: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY3]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64 = COPY $rax + ;... [truncated]

fhahn · 2025-08-01T10:14:55Z

For context, this fixes a regression in the Python interpreter on AArch64 compared to 20.1.0

dianqk

I expect a double-check for #106846 before the merge. Sorry I'm too busy to do this.

fhahn · 2025-08-05T08:42:25Z

I expect a double-check for #106846 before the merge. Sorry I'm too busy to do this.

Sounds good! My understanding from @mikulas-patocka response in #106846 is that there is that the current patch didn't regress their use case.

For the reproducer shared there I am also not seeing any regression on X86 skylake (the only X86 system I have access to)

tru · 2025-08-05T08:51:47Z

I will wait with merging this until we have confirmation that we are good in that other ticket.

tru · 2025-08-08T10:17:39Z

Seems like we are still waiting for confirmation on this one?

fhahn · 2025-08-08T10:23:44Z

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.

Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

mikulas-patocka · 2025-08-09T12:42:40Z

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.

Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

There was a regression - the pointless move of %r14 to %r15 and back and loading %rsi from the stack that I mentioned in #106846 is actually a regression that was recently added to git. With older clang-22 from Debian Sid (version 1:22~++20250731080150+be449d6b6587-1~exp1), these pointless instructions are not generated.

fhahn · 2025-08-09T14:02:44Z

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.
Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

There was a regression - the pointless move of %r14 to %r15 and back and loading %rsi from the stack that I mentioned in #106846 is actually a regression that was recently added to git. With older clang-22 from Debian Sid (version 1:22~++20250731080150+be449d6b6587-1~exp1), these pointless instructions are not generated.

Right, but was there a runtime regression? after all, without the patch, the Python on ARM64 macos regresses by 2-3% for end-to-end workloads

mikulas-patocka · 2025-08-09T15:28:16Z

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.
Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

There was a regression - the pointless move of %r14 to %r15 and back and loading %rsi from the stack that I mentioned in #106846 is actually a regression that was recently added to git. With older clang-22 from Debian Sid (version 1:22~++20250731080150+be449d6b6587-1~exp1), these pointless instructions are not generated.

Right, but was there a runtime regression? after all, without the patch, the Python on ARM64 macos regresses by 2-3% for end-to-end workloads

There was 9% runtime regression on a loop that just adds numbers (so that it stresses this code path significantly). There was no measurable regression on Ajla self-compilation (the difference is just 0.8% - which may be jitter).

fhahn · 2025-08-11T09:18:36Z

There was 9% runtime regression on a loop that just adds numbers (so that it stresses this code path significantly). There was no measurable regression on Ajla self-compilation (the difference is just 0.8% - which may be jitter).

Thanks for clarifying! The regression seems like an X86 backend issue, unrelated directly to tail duplication for computed gotos. Given that there is no (or a very small) end-to-end regression, I think the change should be included in the release, or an argument should be made for reverting/adjusting #150911 (the version of the PR landed on main)

tru · 2025-08-11T09:22:30Z

@RKSimon @phoebewang Since this is related to the X86 backend. What do you think about this PR or maybe if we should revert #150911 ?

phoebewang · 2025-08-14T07:40:14Z

Compared with immediate failures, performance regressions is hard to be found and located. I think that's the reason why we don't backport performance related patches. So I suggest not to backport it especially there's known regression patterns. It's not persuasive a single application not regresses. OTOH, I think it's ok to keep it in main. We will know if there's large effect with time passes.

fhahn · 2025-08-14T07:47:44Z

Compared with immediate failures, performance regressions is hard to be found and located. I think that's the reason why we don't backport performance related patches.

We usually can back-port patches to fix performance regressions, which this patch does (3% regression for Python workloads vs 20.1.0). For example, the patch that was causing the regression was picked for 20.1.1 IIRC.

fhahn · 2025-08-19T20:26:19Z

Ping :)

If the X86 side of things is a blocker, we can also make it only to apply to AArch64 or Apple platforms?

phoebewang · 2025-08-20T03:04:40Z

Ping :)

If the X86 side of things is a blocker, we can also make it only to apply to AArch64 or Apple platforms?

Yeah, that would be good to me.

…lvm#150911) Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR. llvm#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication. In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes. This is causing a ~3% performance regression in some workloads using Python 3.12. This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation. This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG. For the case in llvm#106846, I get the same performance with and without this patch on Skylake. PR: llvm#150911

fhahn · 2025-08-20T10:58:34Z

Ping :)
If the X86 side of things is a blocker, we can also make it only to apply to AArch64 or Apple platforms?

Yeah, that would be good to me.

Sounds good, I updated the PR to restrict this just to apple platforms for now, for which we did careful perf evaluations

phoebewang

LGTM.

github-actions · 2025-08-21T05:59:54Z

@fhahn (or anyone else). If you would like to add a note about this fix in the release notes (completely optional). Please reply to this comment with a one or two sentence description of the fix. When you are done, please add the release:note label to this PR.

fhahn added this to the LLVM 21.x Release milestone Aug 1, 2025

fhahn requested a review from dianqk August 1, 2025 10:13

github-project-automation bot added this to LLVM Release Status Aug 1, 2025

github-project-automation bot moved this to Needs Triage in LLVM Release Status Aug 1, 2025

llvmbot added backend:AArch64 backend:X86 llvm:codegen labels Aug 1, 2025

fhahn mentioned this pull request Aug 1, 2025

[TailDup] Delay aggressive computed-goto taildup to after RegAlloc. #150911

Merged

dianqk approved these changes Aug 1, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to Needs Merge in LLVM Release Status Aug 1, 2025

fhahn force-pushed the tail-dup-computed-goto-21.x branch from 0a2665f to c587c24 Compare August 20, 2025 10:07

phoebewang approved these changes Aug 20, 2025

View reviewed changes

tru merged commit c587c24 into llvm:release/21.x Aug 21, 2025
13 checks passed

github-project-automation bot moved this from Needs Merge to Done in LLVM Release Status Aug 21, 2025

fhahn deleted the tail-dup-computed-goto-21.x branch August 21, 2025 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release/21.x: [TailDup] Delay aggressive computed-goto taildup to after RegAlloc. (#150911) #151680

release/21.x: [TailDup] Delay aggressive computed-goto taildup to after RegAlloc. (#150911) #151680

fhahn commented Aug 1, 2025

llvmbot commented Aug 1, 2025

llvmbot commented Aug 1, 2025

fhahn commented Aug 1, 2025

dianqk left a comment

fhahn commented Aug 5, 2025

tru commented Aug 5, 2025

tru commented Aug 8, 2025

fhahn commented Aug 8, 2025

mikulas-patocka commented Aug 9, 2025

fhahn commented Aug 9, 2025

mikulas-patocka commented Aug 9, 2025

fhahn commented Aug 11, 2025

tru commented Aug 11, 2025

phoebewang commented Aug 14, 2025

fhahn commented Aug 14, 2025

fhahn commented Aug 19, 2025

phoebewang commented Aug 20, 2025

fhahn commented Aug 20, 2025

phoebewang left a comment

Uh oh!

github-actions bot commented Aug 21, 2025

Labels

6 participants

release/21.x: [TailDup] Delay aggressive computed-goto taildup to after RegAlloc. (#150911) #151680

﻿release/21.x: [TailDup] Delay aggressive computed-goto taildup to after RegAlloc. (#150911) #151680

Conversation

fhahn commented Aug 1, 2025

llvmbot commented Aug 1, 2025

llvmbot commented Aug 1, 2025

fhahn commented Aug 1, 2025

dianqk left a comment

Choose a reason for hiding this comment

fhahn commented Aug 5, 2025

tru commented Aug 5, 2025

tru commented Aug 8, 2025

fhahn commented Aug 8, 2025

mikulas-patocka commented Aug 9, 2025

fhahn commented Aug 9, 2025

mikulas-patocka commented Aug 9, 2025

fhahn commented Aug 11, 2025

tru commented Aug 11, 2025

phoebewang commented Aug 14, 2025

fhahn commented Aug 14, 2025

fhahn commented Aug 19, 2025

phoebewang commented Aug 20, 2025

fhahn commented Aug 20, 2025

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 21, 2025

Labels

6 participants

release/21.x: [TailDup] Delay aggressive computed-goto taildup to after RegAlloc. (#150911) #151680