Skip to content

Conversation

@fhahn
Copy link
Contributor

@fhahn fhahn commented Aug 1, 2025

Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR.

#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication.

In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes.

This is causing a ~3% performance regression in some workloads using Python 3.12.

This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation.

This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG.

For the case in #106846, I get the same performance with and without this patch on Skylake.

PR: #150911

@llvmbot
Copy link
Member

llvmbot commented Aug 1, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Florian Hahn (fhahn)

Changes

Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR.

#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication.

In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes.

This is causing a ~3% performance regression in some workloads using Python 3.12.

This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation.

This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG.

For the case in #106846, I get the same performance with and without this patch on Skylake.

PR: #150911


Patch is 22.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151680.diff

5 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+5-4)
  • (modified) llvm/lib/CodeGen/TailDuplicator.cpp (+11-7)
  • (added) llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll (+143)
  • (renamed) llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir (+23-21)
  • (added) llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir (+128)
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h index 938d71dd030e8..9e3d9196cc184 100644 --- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h +++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h @@ -323,10 +323,11 @@ class MachineBasicBlock const MachineFunction *getParent() const { return xParent; } MachineFunction *getParent() { return xParent; } - /// Returns true if the original IR terminator is an `indirectbr`. This - /// typically corresponds to a `goto` in C, rather than jump tables. - bool terminatorIsComputedGoto() const { - return back().isIndirectBranch() && + /// Returns true if the original IR terminator is an `indirectbr` with + /// successor blocks. This typically corresponds to a `goto` in C, rather than + /// jump tables. + bool terminatorIsComputedGotoWithSuccessors() const { + return back().isIndirectBranch() && !succ_empty() && llvm::all_of(successors(), [](const MachineBasicBlock *Succ) { return Succ->isIRBlockAddressTaken(); }); diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp index a88c57fdc165a..5d720fbbf1c61 100644 --- a/llvm/lib/CodeGen/TailDuplicator.cpp +++ b/llvm/lib/CodeGen/TailDuplicator.cpp @@ -604,12 +604,21 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, bool HasComputedGoto = false; if (!TailBB.empty()) { HasIndirectbr = TailBB.back().isIndirectBranch(); - HasComputedGoto = TailBB.terminatorIsComputedGoto(); + HasComputedGoto = TailBB.terminatorIsComputedGotoWithSuccessors(); } if (HasIndirectbr && PreRegAlloc) MaxDuplicateCount = TailDupIndirectBranchSize; + // Allow higher limits when the block has computed-gotos and running after + // register allocation. NB. This basically unfactors computed gotos that were + // factored early on in the compilation process to speed up edge based data + // flow. If we do not unfactor them again, it can seriously pessimize code + // with many computed jumps in the source code, such as interpreters. + // Therefore we do not restrict the computed gotos. + if (HasComputedGoto && !PreRegAlloc) + MaxDuplicateCount = std::max(MaxDuplicateCount, 10u); + // Check the instructions in the block to determine whether tail-duplication // is invalid or unlikely to be profitable. unsigned InstrCount = 0; @@ -663,12 +672,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, // Duplicating a BB which has both multiple predecessors and successors will // may cause huge amount of PHI nodes. If we want to remove this limitation, // we have to address https://github.com/llvm/llvm-project/issues/78578. - // NB. This basically unfactors computed gotos that were factored early on in - // the compilation process to speed up edge based data flow. If we do not - // unfactor them again, it can seriously pessimize code with many computed - // jumps in the source code, such as interpreters. Therefore we do not - // restrict the computed gotos. - if (!HasComputedGoto && TailBB.pred_size() > TailDupPredSize && + if (PreRegAlloc && TailBB.pred_size() > TailDupPredSize && TailBB.succ_size() > TailDupSuccSize) { // If TailBB or any of its successors contains a phi, we may have to add a // large number of additional phis with additional incoming values. diff --git a/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll new file mode 100644 index 0000000000000..381904f776604 --- /dev/null +++ b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll @@ -0,0 +1,143 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -tail-dup-pred-size=2 -tail-dup-succ-size=2 -o - %s | FileCheck %s + +target triple = "arm64-apple-macosx13.0.0" + +@opcode.targets = local_unnamed_addr constant [6 x ptr] [ptr blockaddress(@test_interp, %op1.bb), ptr blockaddress(@test_interp, %op6.bb), ptr blockaddress(@test_interp, %loop.header), ptr blockaddress(@test_interp, %op2.bb), ptr blockaddress(@test_interp, %op4.bb), ptr blockaddress(@test_interp, %op5.bb)] + +define void @test_interp(ptr %frame, ptr %dst) { +; CHECK-LABEL: test_interp: +; CHECK: ; %bb.0: ; %entry +; CHECK-NEXT: stp x24, x23, [sp, #-64]! ; 16-byte Folded Spill +; CHECK-NEXT: stp x22, x21, [sp, #16] ; 16-byte Folded Spill +; CHECK-NEXT: stp x20, x19, [sp, #32] ; 16-byte Folded Spill +; CHECK-NEXT: stp x29, x30, [sp, #48] ; 16-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 64 +; CHECK-NEXT: .cfi_offset w30, -8 +; CHECK-NEXT: .cfi_offset w29, -16 +; CHECK-NEXT: .cfi_offset w19, -24 +; CHECK-NEXT: .cfi_offset w20, -32 +; CHECK-NEXT: .cfi_offset w21, -40 +; CHECK-NEXT: .cfi_offset w22, -48 +; CHECK-NEXT: .cfi_offset w23, -56 +; CHECK-NEXT: .cfi_offset w24, -64 +; CHECK-NEXT: Lloh0: +; CHECK-NEXT: adrp x21, _opcode.targets@PAGE +; CHECK-NEXT: Lloh1: +; CHECK-NEXT: add x21, x21, _opcode.targets@PAGEOFF +; CHECK-NEXT: mov x24, xzr +; CHECK-NEXT: add x8, x21, xzr, lsl #3 +; CHECK-NEXT: mov x19, x1 +; CHECK-NEXT: mov x20, x0 +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: mov w22, #1 ; =0x1 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp0: ; Block address taken +; CHECK-NEXT: LBB0_1: ; %loop.header +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp1: ; Block address taken +; CHECK-NEXT: LBB0_2: ; %op1.bb +; CHECK-NEXT: str xzr, [x19] +; CHECK-NEXT: Ltmp2: ; Block address taken +; CHECK-NEXT: LBB0_3: ; %op6.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: ldr x0, [x20, #-8]! +; CHECK-NEXT: ldr x8, [x0, #8] +; CHECK-NEXT: str x22, [x0] +; CHECK-NEXT: ldr x8, [x8, #48] +; CHECK-NEXT: blr x8 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp3: ; Block address taken +; CHECK-NEXT: LBB0_4: ; %op2.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp4: ; Block address taken +; CHECK-NEXT: LBB0_5: ; %op4.bb +; CHECK-NEXT: Ltmp5: ; Block address taken +; CHECK-NEXT: LBB0_6: ; %op5.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: ldur x8, [x23, #12] +; CHECK-NEXT: ldur x9, [x20, #-8] +; CHECK-NEXT: add x23, x23, #20 +; CHECK-NEXT: stp x8, x9, [x20, #-8] +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x20, x20, #8 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: .loh AdrpAdd Lloh0, Lloh1 +entry: + br label %loop.header + +loop.header: + %iv = phi i64 [ 0, %entry ], [ %iv.next, %op1.bb ], [ %iv.next, %op2.bb ], [ %iv.next, %op4.bb ], [ %iv.next, %op5.bb ], [ %iv.next, %op6.bb ], [ %iv.next, %loop.header ] + %stack.pointer = phi ptr [ %frame, %entry ], [ %stack.8, %op1.bb ], [ null, %op2.bb ], [ %stack.next, %op4.bb ], [ %stack.next.2, %op5.bb ], [ %stack.4, %op6.bb ], [ null, %loop.header ] + %next.instr = phi ptr [ null, %entry ], [ %next.instr, %op1.bb ], [ null, %op2.bb ], [ %next.instr.20, %op4.bb ], [ %next.instr.21, %op5.bb ], [ %next.instr, %op6.bb ], [ null, %loop.header ] + %iv.next = add i64 %iv, 1 + %next_op = getelementptr [6 x ptr], ptr @opcode.targets, i64 0, i64 %iv + indirectbr ptr %next_op, [label %op1.bb, label %op6.bb, label %loop.header, label %op2.bb, label %op4.bb, label %op5.bb] + +op1.bb: + store ptr null, ptr %dst, align 8 + %stack.8 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.0 = load ptr, ptr %stack.8, align 8 + store i64 1, ptr %l.0, align 8 + %gep.0 = getelementptr i8, ptr %l.0, i64 8 + %l.1 = load ptr, ptr %gep.0, align 8 + %gep.1 = getelementptr i8, ptr %l.1, i64 48 + %l.2 = load ptr, ptr %gep.1, align 8 + tail call void %l.2(ptr nonnull %l.0) + br label %loop.header + +op2.bb: + store ptr %next.instr, ptr %dst, align 8 + br label %loop.header + +op4.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.20 = getelementptr i8, ptr %next.instr, i64 20 + %stack.2 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.3 = load ptr, ptr %stack.2, align 8 + %next.instr.12 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.12.val = load ptr, ptr %next.instr.12, align 2 + store ptr %next.instr.12.val, ptr %stack.2, align 8 + store ptr %l.3, ptr %stack.pointer, align 8 + %stack.next = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op5.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.21 = getelementptr i8, ptr %next.instr, i64 20 + %stack.3 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.4 = load ptr, ptr %stack.3, align 8 + %next.instr.2 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.2.val = load ptr, ptr %next.instr.2, align 2 + store ptr %next.instr.2.val, ptr %stack.3, align 8 + store ptr %l.4, ptr %stack.pointer, align 8 + %stack.next.2 = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op6.bb: + %stack.4 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.5 = load ptr, ptr %stack.4, align 8 + store i64 1, ptr %l.5, align 8 + %gep.5 = getelementptr i8, ptr %l.5, i64 8 + %l.6 = load ptr, ptr %gep.5, align 8 + %gep.6 = getelementptr i8, ptr %l.6, i64 48 + %l.7 = load ptr, ptr %gep.6, align 8 + tail call void %l.7(ptr nonnull %l.5) + br label %loop.header +} diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir similarity index 93% rename from llvm/test/CodeGen/X86/tail-dup-computed-goto.mir rename to llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir index 17de405928d37..0f2896463a8af 100644 --- a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir +++ b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir @@ -1,6 +1,8 @@ # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 # RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s -# Check that only the computed goto is not be restrict by tail-dup-pred-size and tail-dup-succ-size. +# +# Check that only the computed goto and others are restricted by tail-dup-pred-size and tail-dup-succ-size. +# --- | @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] declare i64 @f0() @@ -30,54 +32,54 @@ tracksRegLiveness: true body: | ; CHECK-LABEL: name: computed_goto ; CHECK: bb.0: - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64_nosp = COPY [[COPY2]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY6]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.4.bb4 (ir-block-address-taken %ir-block.bb4): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gr64_nosp = COPY [[COPY8]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY8]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.5: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: [[PHI:%[0-9]+]]:gr64_nosp = PHI [[COPY]], %bb.0, [[COPY4]], %bb.4, [[COPY3]], %bb.3, [[COPY2]], %bb.2, [[COPY1]], %bb.1 + ; CHECK-NEXT: JMP64m $noreg, 8, [[PHI]], @computed_goto.dispatch, $noreg bb.0: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax diff --git a/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir new file mode 100644 index 0000000000000..e272e7ee3cb0f --- /dev/null +++ b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir @@ -0,0 +1,128 @@ +# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 +# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s +# +# Check that only the computed gotos are duplicated aggressively. +# +--- | + @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] + declare i64 @f0() + declare i64 @f1() + declare i64 @f2() + declare i64 @f3() + declare i64 @f4() + declare i64 @f5() + define void @computed_goto() { + start: + ret void + bb1: + ret void + bb2: + ret void + bb3: + ret void + bb4: + ret void + } + define void @jump_table() { ret void } + define void @jump_table_pic() { ret void } +... +--- +name: computed_goto +alignment: 1 +exposesReturnsTwice: false +legalized: false +regBankSelected: false +selected: false +failedISel: false +tracksRegLiveness: true +body: | + ; CHECK-LABEL: name: computed_goto + ; CHECK: bb.0: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY3]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64 = COPY $rax + ;... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented Aug 1, 2025

@llvm/pr-subscribers-backend-x86

Author: Florian Hahn (fhahn)

Changes

Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR.

#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication.

In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes.

This is causing a ~3% performance regression in some workloads using Python 3.12.

This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation.

This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG.

For the case in #106846, I get the same performance with and without this patch on Skylake.

PR: #150911


Patch is 22.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151680.diff

5 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+5-4)
  • (modified) llvm/lib/CodeGen/TailDuplicator.cpp (+11-7)
  • (added) llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll (+143)
  • (renamed) llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir (+23-21)
  • (added) llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir (+128)
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h index 938d71dd030e8..9e3d9196cc184 100644 --- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h +++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h @@ -323,10 +323,11 @@ class MachineBasicBlock const MachineFunction *getParent() const { return xParent; } MachineFunction *getParent() { return xParent; } - /// Returns true if the original IR terminator is an `indirectbr`. This - /// typically corresponds to a `goto` in C, rather than jump tables. - bool terminatorIsComputedGoto() const { - return back().isIndirectBranch() && + /// Returns true if the original IR terminator is an `indirectbr` with + /// successor blocks. This typically corresponds to a `goto` in C, rather than + /// jump tables. + bool terminatorIsComputedGotoWithSuccessors() const { + return back().isIndirectBranch() && !succ_empty() && llvm::all_of(successors(), [](const MachineBasicBlock *Succ) { return Succ->isIRBlockAddressTaken(); }); diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp index a88c57fdc165a..5d720fbbf1c61 100644 --- a/llvm/lib/CodeGen/TailDuplicator.cpp +++ b/llvm/lib/CodeGen/TailDuplicator.cpp @@ -604,12 +604,21 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, bool HasComputedGoto = false; if (!TailBB.empty()) { HasIndirectbr = TailBB.back().isIndirectBranch(); - HasComputedGoto = TailBB.terminatorIsComputedGoto(); + HasComputedGoto = TailBB.terminatorIsComputedGotoWithSuccessors(); } if (HasIndirectbr && PreRegAlloc) MaxDuplicateCount = TailDupIndirectBranchSize; + // Allow higher limits when the block has computed-gotos and running after + // register allocation. NB. This basically unfactors computed gotos that were + // factored early on in the compilation process to speed up edge based data + // flow. If we do not unfactor them again, it can seriously pessimize code + // with many computed jumps in the source code, such as interpreters. + // Therefore we do not restrict the computed gotos. + if (HasComputedGoto && !PreRegAlloc) + MaxDuplicateCount = std::max(MaxDuplicateCount, 10u); + // Check the instructions in the block to determine whether tail-duplication // is invalid or unlikely to be profitable. unsigned InstrCount = 0; @@ -663,12 +672,7 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple, // Duplicating a BB which has both multiple predecessors and successors will // may cause huge amount of PHI nodes. If we want to remove this limitation, // we have to address https://github.com/llvm/llvm-project/issues/78578. - // NB. This basically unfactors computed gotos that were factored early on in - // the compilation process to speed up edge based data flow. If we do not - // unfactor them again, it can seriously pessimize code with many computed - // jumps in the source code, such as interpreters. Therefore we do not - // restrict the computed gotos. - if (!HasComputedGoto && TailBB.pred_size() > TailDupPredSize && + if (PreRegAlloc && TailBB.pred_size() > TailDupPredSize && TailBB.succ_size() > TailDupSuccSize) { // If TailBB or any of its successors contains a phi, we may have to add a // large number of additional phis with additional incoming values. diff --git a/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll new file mode 100644 index 0000000000000..381904f776604 --- /dev/null +++ b/llvm/test/CodeGen/AArch64/late-taildup-computed-goto.ll @@ -0,0 +1,143 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -tail-dup-pred-size=2 -tail-dup-succ-size=2 -o - %s | FileCheck %s + +target triple = "arm64-apple-macosx13.0.0" + +@opcode.targets = local_unnamed_addr constant [6 x ptr] [ptr blockaddress(@test_interp, %op1.bb), ptr blockaddress(@test_interp, %op6.bb), ptr blockaddress(@test_interp, %loop.header), ptr blockaddress(@test_interp, %op2.bb), ptr blockaddress(@test_interp, %op4.bb), ptr blockaddress(@test_interp, %op5.bb)] + +define void @test_interp(ptr %frame, ptr %dst) { +; CHECK-LABEL: test_interp: +; CHECK: ; %bb.0: ; %entry +; CHECK-NEXT: stp x24, x23, [sp, #-64]! ; 16-byte Folded Spill +; CHECK-NEXT: stp x22, x21, [sp, #16] ; 16-byte Folded Spill +; CHECK-NEXT: stp x20, x19, [sp, #32] ; 16-byte Folded Spill +; CHECK-NEXT: stp x29, x30, [sp, #48] ; 16-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 64 +; CHECK-NEXT: .cfi_offset w30, -8 +; CHECK-NEXT: .cfi_offset w29, -16 +; CHECK-NEXT: .cfi_offset w19, -24 +; CHECK-NEXT: .cfi_offset w20, -32 +; CHECK-NEXT: .cfi_offset w21, -40 +; CHECK-NEXT: .cfi_offset w22, -48 +; CHECK-NEXT: .cfi_offset w23, -56 +; CHECK-NEXT: .cfi_offset w24, -64 +; CHECK-NEXT: Lloh0: +; CHECK-NEXT: adrp x21, _opcode.targets@PAGE +; CHECK-NEXT: Lloh1: +; CHECK-NEXT: add x21, x21, _opcode.targets@PAGEOFF +; CHECK-NEXT: mov x24, xzr +; CHECK-NEXT: add x8, x21, xzr, lsl #3 +; CHECK-NEXT: mov x19, x1 +; CHECK-NEXT: mov x20, x0 +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: mov w22, #1 ; =0x1 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp0: ; Block address taken +; CHECK-NEXT: LBB0_1: ; %loop.header +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp1: ; Block address taken +; CHECK-NEXT: LBB0_2: ; %op1.bb +; CHECK-NEXT: str xzr, [x19] +; CHECK-NEXT: Ltmp2: ; Block address taken +; CHECK-NEXT: LBB0_3: ; %op6.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: ldr x0, [x20, #-8]! +; CHECK-NEXT: ldr x8, [x0, #8] +; CHECK-NEXT: str x22, [x0] +; CHECK-NEXT: ldr x8, [x8, #48] +; CHECK-NEXT: blr x8 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp3: ; Block address taken +; CHECK-NEXT: LBB0_4: ; %op2.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: mov x20, xzr +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: mov x23, xzr +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: Ltmp4: ; Block address taken +; CHECK-NEXT: LBB0_5: ; %op4.bb +; CHECK-NEXT: Ltmp5: ; Block address taken +; CHECK-NEXT: LBB0_6: ; %op5.bb +; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1 +; CHECK-NEXT: str x23, [x19] +; CHECK-NEXT: ldur x8, [x23, #12] +; CHECK-NEXT: ldur x9, [x20, #-8] +; CHECK-NEXT: add x23, x23, #20 +; CHECK-NEXT: stp x8, x9, [x20, #-8] +; CHECK-NEXT: add x8, x21, x24, lsl #3 +; CHECK-NEXT: add x20, x20, #8 +; CHECK-NEXT: add x24, x24, #1 +; CHECK-NEXT: br x8 +; CHECK-NEXT: .loh AdrpAdd Lloh0, Lloh1 +entry: + br label %loop.header + +loop.header: + %iv = phi i64 [ 0, %entry ], [ %iv.next, %op1.bb ], [ %iv.next, %op2.bb ], [ %iv.next, %op4.bb ], [ %iv.next, %op5.bb ], [ %iv.next, %op6.bb ], [ %iv.next, %loop.header ] + %stack.pointer = phi ptr [ %frame, %entry ], [ %stack.8, %op1.bb ], [ null, %op2.bb ], [ %stack.next, %op4.bb ], [ %stack.next.2, %op5.bb ], [ %stack.4, %op6.bb ], [ null, %loop.header ] + %next.instr = phi ptr [ null, %entry ], [ %next.instr, %op1.bb ], [ null, %op2.bb ], [ %next.instr.20, %op4.bb ], [ %next.instr.21, %op5.bb ], [ %next.instr, %op6.bb ], [ null, %loop.header ] + %iv.next = add i64 %iv, 1 + %next_op = getelementptr [6 x ptr], ptr @opcode.targets, i64 0, i64 %iv + indirectbr ptr %next_op, [label %op1.bb, label %op6.bb, label %loop.header, label %op2.bb, label %op4.bb, label %op5.bb] + +op1.bb: + store ptr null, ptr %dst, align 8 + %stack.8 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.0 = load ptr, ptr %stack.8, align 8 + store i64 1, ptr %l.0, align 8 + %gep.0 = getelementptr i8, ptr %l.0, i64 8 + %l.1 = load ptr, ptr %gep.0, align 8 + %gep.1 = getelementptr i8, ptr %l.1, i64 48 + %l.2 = load ptr, ptr %gep.1, align 8 + tail call void %l.2(ptr nonnull %l.0) + br label %loop.header + +op2.bb: + store ptr %next.instr, ptr %dst, align 8 + br label %loop.header + +op4.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.20 = getelementptr i8, ptr %next.instr, i64 20 + %stack.2 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.3 = load ptr, ptr %stack.2, align 8 + %next.instr.12 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.12.val = load ptr, ptr %next.instr.12, align 2 + store ptr %next.instr.12.val, ptr %stack.2, align 8 + store ptr %l.3, ptr %stack.pointer, align 8 + %stack.next = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op5.bb: + store ptr %next.instr, ptr %dst, align 8 + %next.instr.21 = getelementptr i8, ptr %next.instr, i64 20 + %stack.3 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.4 = load ptr, ptr %stack.3, align 8 + %next.instr.2 = getelementptr i8, ptr %next.instr, i64 12 + %next.instr.2.val = load ptr, ptr %next.instr.2, align 2 + store ptr %next.instr.2.val, ptr %stack.3, align 8 + store ptr %l.4, ptr %stack.pointer, align 8 + %stack.next.2 = getelementptr i8, ptr %stack.pointer, i64 8 + br label %loop.header + +op6.bb: + %stack.4 = getelementptr i8, ptr %stack.pointer, i64 -8 + %l.5 = load ptr, ptr %stack.4, align 8 + store i64 1, ptr %l.5, align 8 + %gep.5 = getelementptr i8, ptr %l.5, i64 8 + %l.6 = load ptr, ptr %gep.5, align 8 + %gep.6 = getelementptr i8, ptr %l.6, i64 48 + %l.7 = load ptr, ptr %gep.6, align 8 + tail call void %l.7(ptr nonnull %l.5) + br label %loop.header +} diff --git a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir similarity index 93% rename from llvm/test/CodeGen/X86/tail-dup-computed-goto.mir rename to llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir index 17de405928d37..0f2896463a8af 100644 --- a/llvm/test/CodeGen/X86/tail-dup-computed-goto.mir +++ b/llvm/test/CodeGen/X86/early-tail-dup-computed-goto.mir @@ -1,6 +1,8 @@ # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 # RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=early-tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s -# Check that only the computed goto is not be restrict by tail-dup-pred-size and tail-dup-succ-size. +# +# Check that only the computed goto and others are restricted by tail-dup-pred-size and tail-dup-succ-size. +# --- | @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] declare i64 @f0() @@ -30,54 +32,54 @@ tracksRegLiveness: true body: | ; CHECK-LABEL: name: computed_goto ; CHECK: bb.0: - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64_nosp = COPY [[COPY2]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY4]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY6:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY7:%[0-9]+]]:gr64_nosp = COPY [[COPY6]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY6]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: JMP_1 %bb.5 ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: bb.4.bb4 (ir-block-address-taken %ir-block.bb4): - ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: successors: %bb.5(0x80000000) ; CHECK-NEXT: {{ $}} ; CHECK-NEXT: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f4, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax ; CHECK-NEXT: ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp - ; CHECK-NEXT: [[COPY8:%[0-9]+]]:gr64_nosp = COPY $rax - ; CHECK-NEXT: [[COPY9:%[0-9]+]]:gr64_nosp = COPY [[COPY8]] - ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY8]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.5: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: [[PHI:%[0-9]+]]:gr64_nosp = PHI [[COPY]], %bb.0, [[COPY4]], %bb.4, [[COPY3]], %bb.3, [[COPY2]], %bb.2, [[COPY1]], %bb.1 + ; CHECK-NEXT: JMP64m $noreg, 8, [[PHI]], @computed_goto.dispatch, $noreg bb.0: ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax diff --git a/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir new file mode 100644 index 0000000000000..e272e7ee3cb0f --- /dev/null +++ b/llvm/test/CodeGen/X86/late-tail-dup-computed-goto.mir @@ -0,0 +1,128 @@ +# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 5 +# RUN: llc -mtriple=x86_64-unknown-linux-gnu -run-pass=tailduplication -tail-dup-pred-size=1 -tail-dup-succ-size=1 %s -o - | FileCheck %s +# +# Check that only the computed gotos are duplicated aggressively. +# +--- | + @computed_goto.dispatch = constant [5 x ptr] [ptr null, ptr blockaddress(@computed_goto, %bb1), ptr blockaddress(@computed_goto, %bb2), ptr blockaddress(@computed_goto, %bb3), ptr blockaddress(@computed_goto, %bb4)] + declare i64 @f0() + declare i64 @f1() + declare i64 @f2() + declare i64 @f3() + declare i64 @f4() + declare i64 @f5() + define void @computed_goto() { + start: + ret void + bb1: + ret void + bb2: + ret void + bb3: + ret void + bb4: + ret void + } + define void @jump_table() { ret void } + define void @jump_table_pic() { ret void } +... +--- +name: computed_goto +alignment: 1 +exposesReturnsTwice: false +legalized: false +regBankSelected: false +selected: false +failedISel: false +tracksRegLiveness: true +body: | + ; CHECK-LABEL: name: computed_goto + ; CHECK: bb.0: + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f0, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.1.bb1 (ir-block-address-taken %ir-block.bb1): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f1, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY3:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY3]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.2.bb2 (ir-block-address-taken %ir-block.bb2): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f2, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY4:%[0-9]+]]:gr64 = COPY $rax + ; CHECK-NEXT: [[COPY1:%[0-9]+]]:gr64_nosp = COPY [[COPY4]] + ; CHECK-NEXT: [[COPY2:%[0-9]+]]:gr64_nosp = COPY [[COPY1]] + ; CHECK-NEXT: JMP64m $noreg, 8, [[COPY2]], @computed_goto.dispatch, $noreg + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: bb.3.bb3 (ir-block-address-taken %ir-block.bb3): + ; CHECK-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000) + ; CHECK-NEXT: {{ $}} + ; CHECK-NEXT: CALL64pcrel32 target-flags(x86-plt) @f3, csr_64, implicit $rsp, implicit $ssp, implicit-def $rsp, implicit-def $ssp, implicit-def $rax + ; CHECK-NEXT: [[COPY5:%[0-9]+]]:gr64 = COPY $rax + ;... [truncated] 
@fhahn
Copy link
Contributor Author

fhahn commented Aug 1, 2025

For context, this fixes a regression in the Python interpreter on AArch64 compared to 20.1.0

Copy link
Member

@dianqk dianqk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect a double-check for #106846 before the merge. Sorry I'm too busy to do this.

@github-project-automation github-project-automation bot moved this from Needs Triage to Needs Merge in LLVM Release Status Aug 1, 2025
@fhahn
Copy link
Contributor Author

fhahn commented Aug 5, 2025

I expect a double-check for #106846 before the merge. Sorry I'm too busy to do this.

Sounds good! My understanding from @mikulas-patocka response in #106846 is that there is that the current patch didn't regress their use case.

For the reproducer shared there I am also not seeing any regression on X86 skylake (the only X86 system I have access to)

@tru
Copy link
Collaborator

tru commented Aug 5, 2025

I will wait with merging this until we have confirmation that we are good in that other ticket.

@tru
Copy link
Collaborator

tru commented Aug 8, 2025

Seems like we are still waiting for confirmation on this one?

@fhahn
Copy link
Contributor Author

fhahn commented Aug 8, 2025

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.

Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

@mikulas-patocka
Copy link

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.

Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

There was a regression - the pointless move of %r14 to %r15 and back and loading %rsi from the stack that I mentioned in #106846 is actually a regression that was recently added to git. With older clang-22 from Debian Sid (version 1:22~++20250731080150+be449d6b6587-1~exp1), these pointless instructions are not generated.

@fhahn
Copy link
Contributor Author

fhahn commented Aug 9, 2025

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.
Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

There was a regression - the pointless move of %r14 to %r15 and back and loading %rsi from the stack that I mentioned in #106846 is actually a regression that was recently added to git. With older clang-22 from Debian Sid (version 1:22~++20250731080150+be449d6b6587-1~exp1), these pointless instructions are not generated.

Right, but was there a runtime regression? after all, without the patch, the Python on ARM64 macos regresses by 2-3% for end-to-end workloads

@mikulas-patocka
Copy link

Seems like we are still waiting for confirmation on this one?

My understanding from @mikulas-patocka in #106846 is that there was no regression on current main after merging over a week ago, so I think we should be good to go.
Not sure what the remaining time-line is, but it would be good to make sure the regression is fixed in the release.

There was a regression - the pointless move of %r14 to %r15 and back and loading %rsi from the stack that I mentioned in #106846 is actually a regression that was recently added to git. With older clang-22 from Debian Sid (version 1:22~++20250731080150+be449d6b6587-1~exp1), these pointless instructions are not generated.

Right, but was there a runtime regression? after all, without the patch, the Python on ARM64 macos regresses by 2-3% for end-to-end workloads

There was 9% runtime regression on a loop that just adds numbers (so that it stresses this code path significantly). There was no measurable regression on Ajla self-compilation (the difference is just 0.8% - which may be jitter).

@fhahn
Copy link
Contributor Author

fhahn commented Aug 11, 2025

There was 9% runtime regression on a loop that just adds numbers (so that it stresses this code path significantly). There was no measurable regression on Ajla self-compilation (the difference is just 0.8% - which may be jitter).

Thanks for clarifying! The regression seems like an X86 backend issue, unrelated directly to tail duplication for computed gotos. Given that there is no (or a very small) end-to-end regression, I think the change should be included in the release, or an argument should be made for reverting/adjusting #150911 (the version of the PR landed on main)

@tru
Copy link
Collaborator

tru commented Aug 11, 2025

@RKSimon @phoebewang Since this is related to the X86 backend. What do you think about this PR or maybe if we should revert #150911 ?

@phoebewang
Copy link
Contributor

Compared with immediate failures, performance regressions is hard to be found and located. I think that's the reason why we don't backport performance related patches. So I suggest not to backport it especially there's known regression patterns. It's not persuasive a single application not regresses. OTOH, I think it's ok to keep it in main. We will know if there's large effect with time passes.

@fhahn
Copy link
Contributor Author

fhahn commented Aug 14, 2025

Compared with immediate failures, performance regressions is hard to be found and located. I think that's the reason why we don't backport performance related patches.

We usually can back-port patches to fix performance regressions, which this patch does (3% regression for Python workloads vs 20.1.0). For example, the patch that was causing the regression was picked for 20.1.1 IIRC.

@fhahn
Copy link
Contributor Author

fhahn commented Aug 19, 2025

Ping :)

If the X86 side of things is a blocker, we can also make it only to apply to AArch64 or Apple platforms?

@phoebewang
Copy link
Contributor

Ping :)

If the X86 side of things is a blocker, we can also make it only to apply to AArch64 or Apple platforms?

Yeah, that would be good to me.

…lvm#150911) Back-ports additional tests (eb9febb4a6b0, dc697de12792), refactoring (43c9c14577db) and functional change (18f1369297f4) in a single PR. llvm#114990 allowed more aggressive tail duplication for computed-gotos in both pre- and post-regalloc tail duplication. In some cases, performing tail-duplication too early can lead to worse results, especially if we duplicate blocks with a number of phi nodes. This is causing a ~3% performance regression in some workloads using Python 3.12. This patch updates TailDup to delay aggressive tail-duplication for computed gotos to after register allocation. This means we can keep the non-duplicated version for a bit longer throughout the backend, which should reduce compile-time as well as allowing a number of optimizations and simplifications to trigger before drastically expanding the CFG. For the case in llvm#106846, I get the same performance with and without this patch on Skylake. PR: llvm#150911
@fhahn fhahn force-pushed the tail-dup-computed-goto-21.x branch from 0a2665f to c587c24 Compare August 20, 2025 10:07
@fhahn
Copy link
Contributor Author

fhahn commented Aug 20, 2025

Ping :)
If the X86 side of things is a blocker, we can also make it only to apply to AArch64 or Apple platforms?

Yeah, that would be good to me.

Sounds good, I updated the PR to restrict this just to apple platforms for now, for which we did careful perf evaluations

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tru tru merged commit c587c24 into llvm:release/21.x Aug 21, 2025
13 checks passed
@github-project-automation github-project-automation bot moved this from Needs Merge to Done in LLVM Release Status Aug 21, 2025
@github-actions
Copy link

@fhahn (or anyone else). If you would like to add a note about this fix in the release notes (completely optional). Please reply to this comment with a one or two sentence description of the fix. When you are done, please add the release:note label to this PR.

@fhahn fhahn deleted the tail-dup-computed-goto-21.x branch August 21, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment