Skip to content

Conversation

@wangpc-pp
Copy link
Contributor

@wangpc-pp wangpc-pp commented Jun 18, 2024

This can reduce some vtype toggles.

This can be done in pre-ra scheduling as we have moved insertion of
vsetvli after the first RA.

Currently, we override tryCandidate and add a new heuristic based
on comparison of vtypes.

Created using spr 1.3.6-beta.1
@wangpc-pp wangpc-pp requested review from asb and lukel97 and removed request for asb June 18, 2024 13:33
@llvmbot
Copy link
Member

llvmbot commented Jun 18, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Pengcheng Wang (wangpc-pp)

Changes

This can reduce some vtype toggles.

This can be done in pre-ra scheduling as we have moved insertion of
vsetvli after the first RA.

Currently, this is just a PoC and I'd like to gather some feedbacks
to see if I should continue to finish this work.


Full diff: https://github.com/llvm/llvm-project/pull/95924.diff

7 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+35-8)
  • (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+1-33)
  • (modified) llvm/lib/Target/RISCV/CMakeLists.txt (+1)
  • (added) llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp (+83)
  • (added) llvm/lib/Target/RISCV/RISCVMachineScheduler.h (+42)
  • (modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+4-4)
  • (added) llvm/test/CodeGen/RISCV/rvv/schedule.ll (+49)
diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h index b15abf040058e..d1b5b83e5300b 100644 --- a/llvm/include/llvm/CodeGen/MachineScheduler.h +++ b/llvm/include/llvm/CodeGen/MachineScheduler.h @@ -1349,14 +1349,6 @@ class PostGenericScheduler : public GenericSchedulerBase { void pickNodeFromQueue(SchedBoundary &Zone, SchedCandidate &Cand); }; -/// Create the standard converging machine scheduler. This will be used as the -/// default scheduler if the target does not set a default. -/// Adds default DAG mutations. -ScheduleDAGMILive *createGenericSchedLive(MachineSchedContext *C); - -/// Create a generic scheduler with no vreg liveness or DAG mutation passes. -ScheduleDAGMI *createGenericSchedPostRA(MachineSchedContext *C); - /// If ReorderWhileClustering is set to true, no attempt will be made to /// reduce reordering due to store clustering. std::unique_ptr<ScheduleDAGMutation> @@ -1375,6 +1367,41 @@ std::unique_ptr<ScheduleDAGMutation> createCopyConstrainDAGMutation(const TargetInstrInfo *TII, const TargetRegisterInfo *TRI); +/// Create the standard converging machine scheduler. This will be used as the +/// default scheduler if the target does not set a default. +/// Adds default DAG mutations. +template <typename Strategy = GenericScheduler> +ScheduleDAGMILive *createGenericSchedLive(MachineSchedContext *C) { + ScheduleDAGMILive *DAG = + new ScheduleDAGMILive(C, std::make_unique<Strategy>(C)); + // Register DAG post-processors. + // + // FIXME: extend the mutation API to allow earlier mutations to instantiate + // data and pass it to later mutations. Have a single mutation that gathers + // the interesting nodes in one pass. + DAG->addMutation(createCopyConstrainDAGMutation(DAG->TII, DAG->TRI)); + + const TargetSubtargetInfo &STI = C->MF->getSubtarget(); + // Add MacroFusion mutation if fusions are not empty. + const auto &MacroFusions = STI.getMacroFusions(); + if (!MacroFusions.empty()) + DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); + return DAG; +} + +/// Create a generic scheduler with no vreg liveness or DAG mutation passes. +template <typename Strategy = PostGenericScheduler> +ScheduleDAGMI *createGenericSchedPostRA(MachineSchedContext *C) { + ScheduleDAGMI *DAG = new ScheduleDAGMI(C, std::make_unique<Strategy>(C), + /*RemoveKillFlags=*/true); + const TargetSubtargetInfo &STI = C->MF->getSubtarget(); + // Add MacroFusion mutation if fusions are not empty. + const auto &MacroFusions = STI.getMacroFusions(); + if (!MacroFusions.empty()) + DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); + return DAG; +} + } // end namespace llvm #endif // LLVM_CODEGEN_MACHINESCHEDULER_H diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp index cf72f74380835..ac792ad4d5484 100644 --- a/llvm/lib/CodeGen/MachineScheduler.cpp +++ b/llvm/lib/CodeGen/MachineScheduler.cpp @@ -2701,7 +2701,7 @@ void SchedBoundary::bumpNode(SUnit *SU) { unsigned NextCycle = CurrCycle; switch (SchedModel->getMicroOpBufferSize()) { case 0: - assert(ReadyCycle <= CurrCycle && "Broken PendingQueue"); + // assert(ReadyCycle <= CurrCycle && "Broken PendingQueue"); break; case 1: if (ReadyCycle > NextCycle) { @@ -3847,26 +3847,6 @@ void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) { } } -/// Create the standard converging machine scheduler. This will be used as the -/// default scheduler if the target does not set a default. -ScheduleDAGMILive *llvm::createGenericSchedLive(MachineSchedContext *C) { - ScheduleDAGMILive *DAG = - new ScheduleDAGMILive(C, std::make_unique<GenericScheduler>(C)); - // Register DAG post-processors. - // - // FIXME: extend the mutation API to allow earlier mutations to instantiate - // data and pass it to later mutations. Have a single mutation that gathers - // the interesting nodes in one pass. - DAG->addMutation(createCopyConstrainDAGMutation(DAG->TII, DAG->TRI)); - - const TargetSubtargetInfo &STI = C->MF->getSubtarget(); - // Add MacroFusion mutation if fusions are not empty. - const auto &MacroFusions = STI.getMacroFusions(); - if (!MacroFusions.empty()) - DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); - return DAG; -} - static ScheduleDAGInstrs *createConvergingSched(MachineSchedContext *C) { return createGenericSchedLive(C); } @@ -4139,18 +4119,6 @@ void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) { } } -ScheduleDAGMI *llvm::createGenericSchedPostRA(MachineSchedContext *C) { - ScheduleDAGMI *DAG = - new ScheduleDAGMI(C, std::make_unique<PostGenericScheduler>(C), - /*RemoveKillFlags=*/true); - const TargetSubtargetInfo &STI = C->MF->getSubtarget(); - // Add MacroFusion mutation if fusions are not empty. - const auto &MacroFusions = STI.getMacroFusions(); - if (!MacroFusions.empty()) - DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); - return DAG; -} - //===----------------------------------------------------------------------===// // ILP Scheduler. Currently for experimental analysis of heuristics. //===----------------------------------------------------------------------===// diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt index 8715403f3839a..fe3f213b253f7 100644 --- a/llvm/lib/Target/RISCV/CMakeLists.txt +++ b/llvm/lib/Target/RISCV/CMakeLists.txt @@ -44,6 +44,7 @@ add_llvm_target(RISCVCodeGen RISCVISelDAGToDAG.cpp RISCVISelLowering.cpp RISCVMachineFunctionInfo.cpp + RISCVMachineScheduler.cpp RISCVMergeBaseOffset.cpp RISCVOptWInstrs.cpp RISCVPostRAExpandPseudoInsts.cpp diff --git a/llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp b/llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp new file mode 100644 index 0000000000000..d993d840c3d3a --- /dev/null +++ b/llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp @@ -0,0 +1,83 @@ +//===- RISCVMachineScheduler.cpp - MI Scheduler for RISC-V ----------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "RISCVMachineScheduler.h" +#include "MCTargetDesc/RISCVBaseInfo.h" +#include "MCTargetDesc/RISCVMCTargetDesc.h" +#include "RISCVInstrInfo.h" +#include "RISCVSubtarget.h" +#include "llvm/CodeGen/MachineOperand.h" +#include "llvm/CodeGen/MachineScheduler.h" +#include "llvm/CodeGen/ScheduleDAG.h" +#include "llvm/MC/MCInstrDesc.h" +#include "llvm/Support/Debug.h" +#include "llvm/TargetParser/RISCVTargetParser.h" + +using namespace llvm; + +#define DEBUG_TYPE "riscv-prera-sched-strategy" + +static cl::opt<bool> EnableScheduleSameVType( + "riscv-enable-schedule-same-vtype", cl::init(false), cl::Hidden, + cl::desc("Enable scheduling RVV instructions with same vtype first")); + +SUnit *RISCVPreRAMachineSchedStrategy::pickNode(bool &IsTopNode) { + if (EnableScheduleSameVType) { + for (SUnit *SU : Bot.Available) { + MachineInstr *MI = SU->getInstr(); + const MCInstrDesc &Desc = MI->getDesc(); + if (RISCVII::hasSEWOp(Desc.TSFlags)) { + unsigned CurVSEW = MI->getOperand(RISCVII::getSEWOpNum(Desc)).getImm(); + RISCVII::VLMUL CurVLMUL = RISCVII::getLMul(Desc.TSFlags); + if (CurVSEW == PrevVSEW && CurVLMUL == PrevVLMUL) { + Bot.removeReady(SU); + IsTopNode = true; + return SU; + } + } + } + for (SUnit *SU : Bot.Pending) { + MachineInstr *MI = SU->getInstr(); + const MCInstrDesc &Desc = MI->getDesc(); + if (RISCVII::hasSEWOp(Desc.TSFlags)) { + unsigned CurVSEW = MI->getOperand(RISCVII::getSEWOpNum(Desc)).getImm(); + RISCVII::VLMUL CurVLMUL = RISCVII::getLMul(Desc.TSFlags); + if (CurVSEW == PrevVSEW && CurVLMUL == PrevVLMUL) { + Bot.removeReady(SU); + IsTopNode = false; + return SU; + } + } + } + } + return GenericScheduler::pickNode(IsTopNode); +} + +bool RISCVPreRAMachineSchedStrategy::tryCandidate(SchedCandidate &Cand, + SchedCandidate &TryCand, + SchedBoundary *Zone) const { + bool OriginalResult = GenericScheduler::tryCandidate(Cand, TryCand, Zone); + + return OriginalResult; +} + +void RISCVPreRAMachineSchedStrategy::schedNode(SUnit *SU, bool IsTopNode) { + GenericScheduler::schedNode(SU, IsTopNode); + MachineInstr *MI = SU->getInstr(); + const MCInstrDesc &Desc = MI->getDesc(); + if (RISCVII::hasSEWOp(Desc.TSFlags)) { + PrevVSEW = MI->getOperand(RISCVII::getSEWOpNum(Desc)).getImm(); + PrevVLMUL = RISCVII::getLMul(Desc.TSFlags); + } + LLVM_DEBUG(dbgs() << "Previous scheduled Unit: "; + dbgs() << "SU(" << SU->NodeNum << ") - "; SU->getInstr()->dump();); + LLVM_DEBUG(dbgs() << "Previous VSEW : " << (1 << PrevVSEW) << "\n"; + auto LMUL = RISCVVType::decodeVLMUL(PrevVLMUL); + dbgs() << "Previous VLMUL: m" << (LMUL.second ? "f" : "") + << LMUL.first << "\n";); +} diff --git a/llvm/lib/Target/RISCV/RISCVMachineScheduler.h b/llvm/lib/Target/RISCV/RISCVMachineScheduler.h new file mode 100644 index 0000000000000..bd806cef57dcb --- /dev/null +++ b/llvm/lib/Target/RISCV/RISCVMachineScheduler.h @@ -0,0 +1,42 @@ +//===--- RISCVMachineScheduler.h - Custom RISC-V MI scheduler ---*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// +// +// Custom RISC-V MI scheduler. +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_LIB_TARGET_RISCV_RISCVMACHINESCHEDULER_H +#define LLVM_LIB_TARGET_RISCV_RISCVMACHINESCHEDULER_H + +#include "llvm/CodeGen/MachineScheduler.h" +#include "llvm/TargetParser/RISCVTargetParser.h" + +namespace llvm { + +/// A GenericScheduler implementation for RISCV pre RA scheduling. +class RISCVPreRAMachineSchedStrategy : public GenericScheduler { +private: + RISCVII::VLMUL PrevVLMUL; + unsigned PrevVSEW; + +public: + RISCVPreRAMachineSchedStrategy(const MachineSchedContext *C) + : GenericScheduler(C) {} + +protected: + SUnit *pickNode(bool &IsTopNode) override; + + bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand, + SchedBoundary *Zone) const override; + + void schedNode(SUnit *SU, bool IsTopNode) override; +}; + +} // end namespace llvm + +#endif diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp index 35d0b3408d09f..e0dcbbddc3f53 100644 --- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp +++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp @@ -14,6 +14,7 @@ #include "MCTargetDesc/RISCVBaseInfo.h" #include "RISCV.h" #include "RISCVMachineFunctionInfo.h" +#include "RISCVMachineScheduler.h" #include "RISCVTargetObjectFile.h" #include "RISCVTargetTransformInfo.h" #include "TargetInfo/RISCVTargetInfo.h" @@ -340,12 +341,11 @@ class RISCVPassConfig : public TargetPassConfig { ScheduleDAGInstrs * createMachineScheduler(MachineSchedContext *C) const override { - ScheduleDAGMILive *DAG = nullptr; - if (EnableMISchedLoadClustering) { - DAG = createGenericSchedLive(C); + ScheduleDAGMILive *DAG = + createGenericSchedLive<RISCVPreRAMachineSchedStrategy>(C); + if (EnableMISchedLoadClustering) DAG->addMutation(createLoadClusterDAGMutation( DAG->TII, DAG->TRI, /*ReorderWhileClustering=*/true)); - } return DAG; } diff --git a/llvm/test/CodeGen/RISCV/rvv/schedule.ll b/llvm/test/CodeGen/RISCV/rvv/schedule.ll new file mode 100644 index 0000000000000..baf15ef400df5 --- /dev/null +++ b/llvm/test/CodeGen/RISCV/rvv/schedule.ll @@ -0,0 +1,49 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -mtriple=riscv64 -mcpu=sifive-x280 -verify-machineinstrs < %s \ +; RUN: | FileCheck %s --check-prefix=DEFAULT +; RUN: llc -mtriple=riscv64 -mcpu=sifive-x280 -riscv-enable-schedule-same-vtype -verify-machineinstrs < %s \ +; RUN: | FileCheck %s --check-prefix=SAME-VTYPE-FIRST + +define <vscale x 1 x i64> @test(<vscale x 1 x i64> %v64_0, <vscale x 1 x i64> %v64_1, <vscale x 1 x i32> %v32_0, <vscale x 1 x i32> %v32_1) { +; DEFAULT-LABEL: test: +; DEFAULT: # %bb.0: # %entry +; DEFAULT-NEXT: vsetvli a0, zero, e64, m1, ta, ma +; DEFAULT-NEXT: vdiv.vv v12, v8, v9 +; DEFAULT-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; DEFAULT-NEXT: vdiv.vv v13, v10, v11 +; DEFAULT-NEXT: vsetvli zero, zero, e64, m1, ta, ma +; DEFAULT-NEXT: vadd.vv v8, v8, v9 +; DEFAULT-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; DEFAULT-NEXT: vadd.vv v9, v10, v11 +; DEFAULT-NEXT: vsetvli zero, zero, e64, m1, ta, ma +; DEFAULT-NEXT: vadd.vv v8, v8, v12 +; DEFAULT-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; DEFAULT-NEXT: vadd.vv v9, v9, v13 +; DEFAULT-NEXT: vwadd.wv v8, v8, v9 +; DEFAULT-NEXT: ret +; +; SAME-VTYPE-FIRST-LABEL: test: +; SAME-VTYPE-FIRST: # %bb.0: # %entry +; SAME-VTYPE-FIRST-NEXT: vsetvli a0, zero, e64, m1, ta, ma +; SAME-VTYPE-FIRST-NEXT: vadd.vv v12, v8, v9 +; SAME-VTYPE-FIRST-NEXT: vdiv.vv v8, v8, v9 +; SAME-VTYPE-FIRST-NEXT: vadd.vv v8, v12, v8 +; SAME-VTYPE-FIRST-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; SAME-VTYPE-FIRST-NEXT: vadd.vv v9, v10, v11 +; SAME-VTYPE-FIRST-NEXT: vdiv.vv v10, v10, v11 +; SAME-VTYPE-FIRST-NEXT: vadd.vv v9, v9, v10 +; SAME-VTYPE-FIRST-NEXT: vwadd.wv v8, v8, v9 +; SAME-VTYPE-FIRST-NEXT: ret +entry: + %0 = add <vscale x 1 x i64> %v64_0, %v64_1 + %1 = add <vscale x 1 x i32> %v32_0, %v32_1 + %2 = sdiv <vscale x 1 x i64> %v64_0, %v64_1 + %3 = sdiv <vscale x 1 x i32> %v32_0, %v32_1 + %4 = add <vscale x 1 x i64> %0, %2 + %5 = add <vscale x 1 x i32> %1, %3 + + %6 = sext <vscale x 1 x i32> %5 to <vscale x 1 x i64> + %7 = add <vscale x 1 x i64> %4, %6 + ret <vscale x 1 x i64> %7 +} + 
; SAME-VTYPE-FIRST-NEXT: vadd.vv v12, v8, v9
; SAME-VTYPE-FIRST-NEXT: vdiv.vv v8, v8, v9
; SAME-VTYPE-FIRST-NEXT: vadd.vv v8, v12, v8
; SAME-VTYPE-FIRST-NEXT: vsetvli zero, zero, e32, mf2, ta, ma
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty cool idea. Do you know how this impacts performance on a benchmark like spec?

Copy link
Contributor

@michaelmaitland michaelmaitland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty neat idea.

There are a few things we want to balance:

  1. Reduce instruction count due to number of vtype toggles
  2. Avoid the number of stalls due to latency (dependent result not ready)
  3. Avoid the number of stalls due to resource consumption (resources not available)

I am curious how we will be able to balance the three of these. In the current state of this patch, we are prioritizing (1) and falling back to GenericScheduler::pickNode(IsTopNode) to handle (2) and (3) only in the cases when we don't have the ability to do (1). It is unclear to me whether (1) should be so important that we ignore (2) and (3).

It would be nice to have some data on how the current proposed approach impacts performance of benchmarks. I'd also be curious to explore balancing heuristic (1) with (2) and (3) to see how that impacts performance.

@preames
Copy link
Collaborator

preames commented Jun 18, 2024

Interesting prototype!

@michaelmaitland Already responded with a good summary of the concerns, so let me just second him.

My default would be to assume that the vtype toggles are pretty cheap, and that we should purely be using (1) to tie break when (2) and (3) don't order scheduling, but I'll freely admit I don't have any strong data on this. I'd encourage you to run a few workloads, and see what you get with different heuristics.

@kito-cheng kito-cheng requested a review from BeMg June 18, 2024 14:52
if (SUnit *SU = FindPotentialRVVInstruction(Top, true))
return SU;
} else {
if (SUnit *SU =
Copy link
Contributor

@michaelmaitland michaelmaitland Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the GenericScheduler, we tend not to pick from the Pending queues. It is usually better to move the node to Available and keep pick functions to take from Available. Otherwise, there are two cases we run into:

  1. HazardRecognizers try and keep nodes on the Pending queue and this code here will ignore that. It will be really hard to keep the intended functionality of HazardRecognizers if we pick from Pending.
  2. A node is Pending because it will lead to stalls according to scheduler model. Picking from it ignores the scheduler model.

Do we really need to pick from Pending here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think we shouldn't. I did this just because I wanted a quick prototype 😄.
I will make it reasonable later.

Copy link
Contributor

@michaelmaitland michaelmaitland Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave some more thought on this. I think we can reuse GenericScheduler::pickNode and we should instead be overriding pickNodeFromQueue. pickNodeFromQueue is where the real picking of the node happens. GenericScheduler::pickNode just gets the correct queue based on direction and passes it to pickNodeFromQueue. Something like this:

SUnit *RISCVPreRAMachineSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone, const CandPolicy &ZonePolicy, const RegPressureTracker &RPTracker, SchedCandidate &Cand) { SchedCandidate RVVCand = FindRVVCandidate(Zone); GenericScheduler::pickNodeFromQueue(Zone, ZonePolicy, RPTracker, Cand); // Pass SchedBoundary only when comparing nodes from the same boundary. SchedBoundary *ZoneArg = Cand.AtTop == RVVCand.AtTop ? &Zone : nullptr; // TODO: we need to add our own heuristics here or inside an overriden // tryCandidate to make sure that we balance clustering RVV with same vtype // with the existing heuristics such as register pressure, latency, resource usage, etc. if (tryCandidate(RVVCand, Cand, ZoneArg) return RVVCand; return Cand; } 
if (SUnit *SU =
FindPotentialRVVInstructionInQueue(Top, Top.Pending, true))
return SU;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the bidirectional case, do you need to tryCandidate to compare whether the Top or Bot candidate is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point! Will do it in next revision!

@wangpc-pp
Copy link
Contributor Author

We need to do auto-vectorization to generate RVV instructions, so I just test the runtime of TSVC on K230 board.
Options (use -mtune=sifive-p670 here just for vector scheduling):

  • before: clang -O2 -march=rv64gcv_zba_zbb_zbs_zbc -mtune=sifive-p670
  • after: clang -O2 -march=rv64gcv_zba_zbb_zbs_zbc -mtune=sifive-p670 -mllvm -riscv-enable-schedule-same-vtype
name before after after/before
s2102 73.541 30.665 0.416978
s252 7.994 5.858 0.7328
s114 57.398 50.594 0.881459
s1111 31.729 28.35 0.893504
s4116 13.206 11.828 0.895653
s1421 22.705 20.353 0.89641
s4115 16.842 15.114 0.897399
s471 6.411 5.818 0.907503
s3251 36.247 33.349 0.920049
s422 49.604 45.715 0.921599
s4112 14.28 13.164 0.921849
s152 14.872 13.712 0.922001
s1115 70.36 65.187 0.926478
s128 16.384 15.225 0.92926
s141 55.395 51.796 0.93503
s2710 10.834 10.207 0.942127
s424 23.916 22.763 0.95179
s1213 27.269 25.964 0.952143
s221 16.418 15.638 0.952491
s122 7.898 7.552 0.956191
s4114 16.081 15.385 0.956719
s1119 9.899 9.48 0.957672
s119 10.922 10.521 0.963285
s276 67.157 64.79 0.964754
s491 15.247 14.725 0.965764
s4113 18.565 17.954 0.967089
s323 27.369 26.769 0.978077
s118 36.679 35.953 0.980207
s279 13.361 13.124 0.982262
s353 19.059 18.731 0.98279
s256 22.725 22.37 0.984378
s222 18.044 17.763 0.984427
s243 14.326 14.125 0.98597
s115 34.492 34.13 0.989505
s1281 55.798 55.236 0.989928
s318 14.222 14.121 0.992898
s241 45.126 44.829 0.993418
s274 14.573 14.482 0.993756
s1251 55.382 55.073 0.994421
s111 11.033 10.981 0.995287
s351 54.015 53.796 0.995946
vif 4.387 4.37 0.996125
s292 9.011 8.978 0.996338
s258 0.293 0.292 0.996587
s3110 17.651 17.594 0.996771
s116 64.726 64.517 0.996771
s321 11.351 11.32 0.997269
s242 8.126 8.104 0.997293
s442 11.146 11.118 0.997488
s453 14.043 14.012 0.997792
s3112 7.48 7.464 0.997861
s291 16.383 16.348 0.997864
s162 7.632 7.616 0.997904
s124 9.485 9.469 0.998313
s251 30.493 30.448 0.998524
s1221 4.267 4.261 0.998594
s341 25.323 25.292 0.998776
s313 36.626 36.582 0.998799
vdotr 73.265 73.178 0.998813
s481 28.464 28.441 0.999192
s123 17.599 17.586 0.999261
s482 27.494 27.481 0.999527
s1113 25.105 25.094 0.999562
vtvtv 29.63 29.619 0.999629
vag 22.7 22.692 0.999648
s2111 52.88 52.865 0.999716
s322 10.58 10.577 0.999716
s332 14.874 14.87 0.999731
s3111 7.481 7.479 0.999733
s352 85.591 85.574 0.999801
vsumr 62.344 62.334 0.99984
s1161 28.72 28.716 0.999861
s132 18.589 18.587 0.999892
s317 22.739 22.737 0.999912
s316 74.607 74.601 0.99992
s000 6.71 6.71 1
s113 15.401 15.401 1
s331 12.723 12.723 1
s314 84.298 84.299 1.000012
s277 22.83 22.835 1.000219
vpvpv 29.128 29.135 1.00024
s312 83.805 83.826 1.000251
s3113 92.538 92.565 1.000292
s443 20.051 20.057 1.000299
s311 62.329 62.351 1.000353
s4121 7.283 7.286 1.000412
s271 29.64 29.657 1.000574
s342 29.462 29.485 1.000781
s1279 11.357 11.366 1.000792
s261 20.939 20.964 1.001194
vpvtv 29.442 29.482 1.001359
s2244 7.904 7.915 1.001392
s2711 29.644 29.686 1.001417
s452 29.449 29.494 1.001528
s112 27.384 27.45 1.00241
s173 25.803 25.872 1.002674
s315 21.168 21.241 1.003449
s1351 40.024 40.163 1.003473
s244 21.168 21.243 1.003543
s232 4.743 4.767 1.00506
s278 16.781 16.872 1.005423
s2712 30.811 31.016 1.006653
s174 25.666 25.859 1.00752
vbor 2.419 2.438 1.007854
s161 22.161 22.336 1.007897
s212 18.311 18.493 1.009939
s121 18.121 18.305 1.010154
s151 30.192 30.499 1.010168
vpvts 5.893 5.953 1.010182
s127 10.654 10.764 1.010325
s175 6.044 6.107 1.010424
s131 30.186 30.502 1.010468
s176 5.869 5.933 1.010905
s171 5.608 5.673 1.011591
s431 56.242 56.915 1.011966
vtv 56.171 56.866 1.012373
vpv 56.18 56.909 1.012976
s2251 33.527 33.963 1.013004
s172 5.61 5.685 1.013369
s273 14.475 14.671 1.013541
s423 28.373 28.785 1.014521
s2101 18.673 18.945 1.014566
s319 120.721 122.631 1.015822
vas 20.567 20.901 1.01624
s293 4.169 4.25 1.019429
s2233 251.383 256.287 1.019508
s451 17.201 17.579 1.021975
s126 22.997 23.542 1.023699
s257 42.741 43.761 1.023865
s235 459.815 470.903 1.024114
s231 232.829 238.524 1.02446
s441 15.573 15.96 1.024851
s281 28.693 29.479 1.027393
s1232 237.11 244.2 1.029902
s233 662.192 684.512 1.033706
s253 11.363 11.769 1.03573
s125 7.407 7.688 1.037937
s211 26.624 27.68 1.039663
s13110 17.562 18.347 1.044699
s275 38.142 39.928 1.046825
va 24.023 25.205 1.049203
s421 25.037 26.426 1.055478
s1244 40.337 42.873 1.06287
s343 23.314 24.934 1.069486
s1112 14.517 15.864 1.092788
s272 11.442 12.529 1.095001
s2275 721.855 799.062 1.106956
s254 21.049 24.103 1.14509
s31111 12.804 16.004 1.249922
s255 7.586 9.541 1.257712
s4117 8.046 10.265 1.275789
Geomean     0.993474

We can see some improvements and some regressions as well. In total, we don't have much gain here (about 0.65%).
The result is highly implementation-specific, and it may not be so convincing.
I will do more benchmarking and improve the heuristics.

@BeMg
Copy link
Contributor

BeMg commented Jun 20, 2024

Except the latency and instruction count, the register pressure also need to be concerned. IIRC GenericScheduler::tryCandidate should responsible for it.

@michaelmaitland
Copy link
Contributor

michaelmaitland commented Jun 20, 2024

What do you think about the following idea:
1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype
2. Run RISCVVSETVLIInsertion
3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources
4. Run register allocation
5. If the subtarget enables PostMachineScheduler, run it.`

This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.

EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

@BeMg
Copy link
Contributor

BeMg commented Jun 21, 2024

What do you think about the following idea: 1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype 2. Run RISCVVSETVLIInsertion 3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources 4. Run register allocation 5. If the subtarget enables PostMachineScheduler, run it.`

This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.

EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

We could use mutation to constrain the same group of vtype instructions instead of the vsetvli insertion to create barriers between instructions. Based on my experience, this approach still disrupts some patterns in step 3. At best, it eliminates some vsetvli instructions; at worst, it introduces additional spills and reloads. This doesn't seem ideal.

@wangpc-pp
Copy link
Contributor Author

What do you think about the following idea: 1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype 2. Run RISCVVSETVLIInsertion 3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources 4. Run register allocation 5. If the subtarget enables PostMachineScheduler, run it.~ ~This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli` instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.
EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

We could use mutation to constrain the same group of vtype instructions instead of the vsetvli insertion to create barriers between instructions. Based on my experience, this approach still disrupts some patterns in step 3. At best, it eliminates some vsetvli instructions; at worst, it introduces additional spills and reloads. This doesn't seem ideal.

I have thought about the mutation way before but I didn't have a try. I think that can be another feasible approach. Do you have a prototype that can be evaluated?

@BeMg
Copy link
Contributor

BeMg commented Jun 22, 2024

What do you think about the following idea: 1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype 2. Run RISCVVSETVLIInsertion 3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources 4. Run register allocation 5. If the subtarget enables PostMachineScheduler, run it.~ ~This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli` instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.
EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

We could use mutation to constrain the same group of vtype instructions instead of the vsetvli insertion to create barriers between instructions. Based on my experience, this approach still disrupts some patterns in step 3. At best, it eliminates some vsetvli instructions; at worst, it introduces additional spills and reloads. This doesn't seem ideal.

I have thought about the mutation way before but I didn't have a try. I think that can be another feasible approach. Do you have a prototype that can be evaluated?

Sure. I have two prototype to share.

  1. BeMg@a5c4dfb Use mutation to create cluster dependence between instruction in same vsetvl configuration
  2. BeMg@3e63dc1 Overload tryCandidate to make vsetvl be aware by machine scheduler.

They both reuse the vsetvli pass VSETVLInfo to check two instruction exist the same configuration.

The second one is more like approach in this patch but modifying tryCandidate to change the heuristic. It is a way to implement the tie break between vtypes, latencies, register pressures, and resource usage.

I think these PoC still has room to improve but enough to be evaluated for data to compare. (Like mutation could use the weak dependence instead of cluster, custom scheduleStrategy could change the vsetvli-aware priority)

If there are any useful thing you find in these prototype, feel free to integrate into this patch.

Mutation SPEC2k17 data

  Before Spills After Spills Spills Diff Before Reloads After Relods Reloads Diff Before vsetvl After vsetvl vsetvl diff
500.perlbench_r 4307 4307 0 10658 10658 0 529 530 -1
502.gcc_r 13239 13242 -3 29697 29701 -4 2857 2793 64
505.mcf_r 119 119 0 330 330 0 38 38 0
520.omnetpp_r 796 904 -108 1552 1855 -303 1005 1016 -11
523.xalancbmk_r 1835 1835 0 2563 2563 0 2863 2854 9
525.x264_r 4131 4135 -4 8033 8051 -18 3035 2908 127
531.deepsjeng_r 328 343 -15 677 695 -18 313 328 -15
541.leela_r 365 365 0 591 591 0 142 133 9
548.exchange2_r 0 0 0 0 0 0 0 0 0
557.xz_r 347 362 -15 697 711 -14 252 259 -7

Custom Sched SPEC2k17 data

  Before Spills After Spills Spills Diff Before Reloads After Relods Reloads Diff Before vsetvl After vsetvl vsetvl diff
500.perlbench_r 4307 4307 0 10658 10658 0 529 529 0
502.gcc_r 13239 13236 3 29697 29694 3 2857 2843 14
505.mcf_r 119 119 0 330 330 0 38 38 0
520.omnetpp_r 796 796 0 1552 1552 0 1005 1003 2
523.xalancbmk_r 1835 1838 -3 2563 2557 6 2863 2851 12
525.x264_r 4131 4133 -2 8033 8037 -4 3035 2950 85
531.deepsjeng_r 328 328 0 677 677 0 313 312 1
541.leela_r 365 365 0 591 591 0 142 142 0
548.exchange2_r 0 0 0 0 0 0 0 0 0
557.xz_r 347 347 0 697 697 0 252 253 -1
@wangpc-pp
Copy link
Contributor Author

@BeMg Hi Piyou, I don't have much time to make this PR further, if you have interests and time on this, you can continue the work you have done before, thanks! :-)

Created using spr 1.3.6-beta.1
@wangpc-pp wangpc-pp changed the title [RISCV][PoC] Schedule RVV instructions with same type first [RISCV] Schedule RVV instructions with compatible type first Dec 2, 2025
@wangpc-pp
Copy link
Contributor Author

I'd like to push this forward. I don't know if @BeMg has some progresses here?

I want to divide this into several parts:

  1. Separate the computation of vtype info in RISCVInsertVSETVLI.cpp into a reusable standalone file.
  2. Add an RISC-V specific scheduler (may do nothing).
  3. Add the new vtype-based scheduling heuristic.

cc @BeMg @lukel97 @topperc @preames @mshockwave

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

🐧 Linux x64 Test Results

  • 166592 tests passed
  • 2891 tests skipped
  • 1 test failed

Failed Tests

(click on a test name to see its output)

LLVM

LLVM.CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll (Likely Already Failing) This test is already failing at the base commit.
Exit Code: 1 Command Output (stdout): -- # RUN: at line 2 /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll | /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck -check-prefix=GISEL-GFX12 /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll # executed command: /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 # note: command had no output on stdout or stderr # executed command: /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck -check-prefix=GISEL-GFX12 /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll # .---command stderr------------ # | /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll:109:21: error: GISEL-GFX12-NEXT: expected string not found in input # | ; GISEL-GFX12-NEXT: s_wait_alu 0xfffe # | ^ # | <stdin>:103:17: note: scanning from here # | s_alloc_vgpr 64 # | ^ # | <stdin>:104:2: note: possible intended match here # | s_wait_alu depctr_sa_sdst(0) # | ^ # | # | Input file: <stdin> # | Check file: /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll # | # | -dump-input=help explains the following input dump. # | # | Input was: # | <<<<<< # | . # | . # | . # | 98: s_wait_kmcnt 0x0 # | 99: s_mov_b32 s30, callee_high_sgpr@abs32@lo # | 100: s_mov_b32 s31, callee_high_sgpr@abs32@hi # | 101: s_mov_b32 s34, retry_vgpr_alloc@abs32@lo # | 102: s_mov_b32 s35, retry_vgpr_alloc@abs32@hi # | 103: s_alloc_vgpr 64 # | next:109'0 X error: no match found # | 104: s_wait_alu depctr_sa_sdst(0) # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | next:109'1 ? possible intended match # | 105: s_cselect_b64 s[30:31], s[30:31], s[34:35] # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | 106: s_cselect_b32 exec_lo, 7, -1 # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | 107: s_wait_alu depctr_sa_sdst(0) # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | 108: s_setpc_b64 s[30:31] # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~ # | 109: .Lfunc_end2: # | next:109'0 ~~~~~~~~~~~~~ # | . # | . # | . # | >>>>>> # `----------------------------- # error: command failed with exit status: 1 -- 

If these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the infrastructure label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

6 participants