[RISCV] Schedule RVV instructions with compatible type first #95924

wangpc-pp · 2024-06-18T13:33:34Z

This can reduce some vtype toggles.

This can be done in pre-ra scheduling as we have moved insertion of
vsetvli after the first RA.

Currently, we override tryCandidate and add a new heuristic based
on comparison of vtypes.

Created using spr 1.3.6-beta.1

llvmbot · 2024-06-18T13:34:10Z

@llvm/pr-subscribers-backend-risc-v

Author: Pengcheng Wang (wangpc-pp)

Changes

This can reduce some vtype toggles.

This can be done in pre-ra scheduling as we have moved insertion of
vsetvli after the first RA.

Currently, this is just a PoC and I'd like to gather some feedbacks
to see if I should continue to finish this work.

Full diff: https://github.com/llvm/llvm-project/pull/95924.diff

7 Files Affected:

(modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+35-8)
(modified) llvm/lib/CodeGen/MachineScheduler.cpp (+1-33)
(modified) llvm/lib/Target/RISCV/CMakeLists.txt (+1)
(added) llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp (+83)
(added) llvm/lib/Target/RISCV/RISCVMachineScheduler.h (+42)
(modified) llvm/lib/Target/RISCV/RISCVTargetMachine.cpp (+4-4)
(added) llvm/test/CodeGen/RISCV/rvv/schedule.ll (+49)

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h index b15abf040058e..d1b5b83e5300b 100644 --- a/llvm/include/llvm/CodeGen/MachineScheduler.h +++ b/llvm/include/llvm/CodeGen/MachineScheduler.h @@ -1349,14 +1349,6 @@ class PostGenericScheduler : public GenericSchedulerBase { void pickNodeFromQueue(SchedBoundary &Zone, SchedCandidate &Cand); }; -/// Create the standard converging machine scheduler. This will be used as the -/// default scheduler if the target does not set a default. -/// Adds default DAG mutations. -ScheduleDAGMILive *createGenericSchedLive(MachineSchedContext *C); - -/// Create a generic scheduler with no vreg liveness or DAG mutation passes. -ScheduleDAGMI *createGenericSchedPostRA(MachineSchedContext *C); - /// If ReorderWhileClustering is set to true, no attempt will be made to /// reduce reordering due to store clustering. std::unique_ptr<ScheduleDAGMutation> @@ -1375,6 +1367,41 @@ std::unique_ptr<ScheduleDAGMutation> createCopyConstrainDAGMutation(const TargetInstrInfo *TII, const TargetRegisterInfo *TRI); +/// Create the standard converging machine scheduler. This will be used as the +/// default scheduler if the target does not set a default. +/// Adds default DAG mutations. +template <typename Strategy = GenericScheduler> +ScheduleDAGMILive *createGenericSchedLive(MachineSchedContext *C) { + ScheduleDAGMILive *DAG = + new ScheduleDAGMILive(C, std::make_unique<Strategy>(C)); + // Register DAG post-processors. + // + // FIXME: extend the mutation API to allow earlier mutations to instantiate + // data and pass it to later mutations. Have a single mutation that gathers + // the interesting nodes in one pass. + DAG->addMutation(createCopyConstrainDAGMutation(DAG->TII, DAG->TRI)); + + const TargetSubtargetInfo &STI = C->MF->getSubtarget(); + // Add MacroFusion mutation if fusions are not empty. + const auto &MacroFusions = STI.getMacroFusions(); + if (!MacroFusions.empty()) + DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); + return DAG; +} + +/// Create a generic scheduler with no vreg liveness or DAG mutation passes. +template <typename Strategy = PostGenericScheduler> +ScheduleDAGMI *createGenericSchedPostRA(MachineSchedContext *C) { + ScheduleDAGMI *DAG = new ScheduleDAGMI(C, std::make_unique<Strategy>(C), + /*RemoveKillFlags=*/true); + const TargetSubtargetInfo &STI = C->MF->getSubtarget(); + // Add MacroFusion mutation if fusions are not empty. + const auto &MacroFusions = STI.getMacroFusions(); + if (!MacroFusions.empty()) + DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); + return DAG; +} + } // end namespace llvm #endif // LLVM_CODEGEN_MACHINESCHEDULER_H diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp index cf72f74380835..ac792ad4d5484 100644 --- a/llvm/lib/CodeGen/MachineScheduler.cpp +++ b/llvm/lib/CodeGen/MachineScheduler.cpp @@ -2701,7 +2701,7 @@ void SchedBoundary::bumpNode(SUnit *SU) { unsigned NextCycle = CurrCycle; switch (SchedModel->getMicroOpBufferSize()) { case 0: - assert(ReadyCycle <= CurrCycle && "Broken PendingQueue"); + // assert(ReadyCycle <= CurrCycle && "Broken PendingQueue"); break; case 1: if (ReadyCycle > NextCycle) { @@ -3847,26 +3847,6 @@ void GenericScheduler::schedNode(SUnit *SU, bool IsTopNode) { } } -/// Create the standard converging machine scheduler. This will be used as the -/// default scheduler if the target does not set a default. -ScheduleDAGMILive *llvm::createGenericSchedLive(MachineSchedContext *C) { - ScheduleDAGMILive *DAG = - new ScheduleDAGMILive(C, std::make_unique<GenericScheduler>(C)); - // Register DAG post-processors. - // - // FIXME: extend the mutation API to allow earlier mutations to instantiate - // data and pass it to later mutations. Have a single mutation that gathers - // the interesting nodes in one pass. - DAG->addMutation(createCopyConstrainDAGMutation(DAG->TII, DAG->TRI)); - - const TargetSubtargetInfo &STI = C->MF->getSubtarget(); - // Add MacroFusion mutation if fusions are not empty. - const auto &MacroFusions = STI.getMacroFusions(); - if (!MacroFusions.empty()) - DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); - return DAG; -} - static ScheduleDAGInstrs *createConvergingSched(MachineSchedContext *C) { return createGenericSchedLive(C); } @@ -4139,18 +4119,6 @@ void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) { } } -ScheduleDAGMI *llvm::createGenericSchedPostRA(MachineSchedContext *C) { - ScheduleDAGMI *DAG = - new ScheduleDAGMI(C, std::make_unique<PostGenericScheduler>(C), - /*RemoveKillFlags=*/true); - const TargetSubtargetInfo &STI = C->MF->getSubtarget(); - // Add MacroFusion mutation if fusions are not empty. - const auto &MacroFusions = STI.getMacroFusions(); - if (!MacroFusions.empty()) - DAG->addMutation(createMacroFusionDAGMutation(MacroFusions)); - return DAG; -} - //===----------------------------------------------------------------------===// // ILP Scheduler. Currently for experimental analysis of heuristics. //===----------------------------------------------------------------------===// diff --git a/llvm/lib/Target/RISCV/CMakeLists.txt b/llvm/lib/Target/RISCV/CMakeLists.txt index 8715403f3839a..fe3f213b253f7 100644 --- a/llvm/lib/Target/RISCV/CMakeLists.txt +++ b/llvm/lib/Target/RISCV/CMakeLists.txt @@ -44,6 +44,7 @@ add_llvm_target(RISCVCodeGen RISCVISelDAGToDAG.cpp RISCVISelLowering.cpp RISCVMachineFunctionInfo.cpp + RISCVMachineScheduler.cpp RISCVMergeBaseOffset.cpp RISCVOptWInstrs.cpp RISCVPostRAExpandPseudoInsts.cpp diff --git a/llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp b/llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp new file mode 100644 index 0000000000000..d993d840c3d3a --- /dev/null +++ b/llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp @@ -0,0 +1,83 @@ +//===- RISCVMachineScheduler.cpp - MI Scheduler for RISC-V ----------------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// + +#include "RISCVMachineScheduler.h" +#include "MCTargetDesc/RISCVBaseInfo.h" +#include "MCTargetDesc/RISCVMCTargetDesc.h" +#include "RISCVInstrInfo.h" +#include "RISCVSubtarget.h" +#include "llvm/CodeGen/MachineOperand.h" +#include "llvm/CodeGen/MachineScheduler.h" +#include "llvm/CodeGen/ScheduleDAG.h" +#include "llvm/MC/MCInstrDesc.h" +#include "llvm/Support/Debug.h" +#include "llvm/TargetParser/RISCVTargetParser.h" + +using namespace llvm; + +#define DEBUG_TYPE "riscv-prera-sched-strategy" + +static cl::opt<bool> EnableScheduleSameVType( + "riscv-enable-schedule-same-vtype", cl::init(false), cl::Hidden, + cl::desc("Enable scheduling RVV instructions with same vtype first")); + +SUnit *RISCVPreRAMachineSchedStrategy::pickNode(bool &IsTopNode) { + if (EnableScheduleSameVType) { + for (SUnit *SU : Bot.Available) { + MachineInstr *MI = SU->getInstr(); + const MCInstrDesc &Desc = MI->getDesc(); + if (RISCVII::hasSEWOp(Desc.TSFlags)) { + unsigned CurVSEW = MI->getOperand(RISCVII::getSEWOpNum(Desc)).getImm(); + RISCVII::VLMUL CurVLMUL = RISCVII::getLMul(Desc.TSFlags); + if (CurVSEW == PrevVSEW && CurVLMUL == PrevVLMUL) { + Bot.removeReady(SU); + IsTopNode = true; + return SU; + } + } + } + for (SUnit *SU : Bot.Pending) { + MachineInstr *MI = SU->getInstr(); + const MCInstrDesc &Desc = MI->getDesc(); + if (RISCVII::hasSEWOp(Desc.TSFlags)) { + unsigned CurVSEW = MI->getOperand(RISCVII::getSEWOpNum(Desc)).getImm(); + RISCVII::VLMUL CurVLMUL = RISCVII::getLMul(Desc.TSFlags); + if (CurVSEW == PrevVSEW && CurVLMUL == PrevVLMUL) { + Bot.removeReady(SU); + IsTopNode = false; + return SU; + } + } + } + } + return GenericScheduler::pickNode(IsTopNode); +} + +bool RISCVPreRAMachineSchedStrategy::tryCandidate(SchedCandidate &Cand, + SchedCandidate &TryCand, + SchedBoundary *Zone) const { + bool OriginalResult = GenericScheduler::tryCandidate(Cand, TryCand, Zone); + + return OriginalResult; +} + +void RISCVPreRAMachineSchedStrategy::schedNode(SUnit *SU, bool IsTopNode) { + GenericScheduler::schedNode(SU, IsTopNode); + MachineInstr *MI = SU->getInstr(); + const MCInstrDesc &Desc = MI->getDesc(); + if (RISCVII::hasSEWOp(Desc.TSFlags)) { + PrevVSEW = MI->getOperand(RISCVII::getSEWOpNum(Desc)).getImm(); + PrevVLMUL = RISCVII::getLMul(Desc.TSFlags); + } + LLVM_DEBUG(dbgs() << "Previous scheduled Unit: "; + dbgs() << "SU(" << SU->NodeNum << ") - "; SU->getInstr()->dump();); + LLVM_DEBUG(dbgs() << "Previous VSEW : " << (1 << PrevVSEW) << "\n"; + auto LMUL = RISCVVType::decodeVLMUL(PrevVLMUL); + dbgs() << "Previous VLMUL: m" << (LMUL.second ? "f" : "") + << LMUL.first << "\n";); +} diff --git a/llvm/lib/Target/RISCV/RISCVMachineScheduler.h b/llvm/lib/Target/RISCV/RISCVMachineScheduler.h new file mode 100644 index 0000000000000..bd806cef57dcb --- /dev/null +++ b/llvm/lib/Target/RISCV/RISCVMachineScheduler.h @@ -0,0 +1,42 @@ +//===--- RISCVMachineScheduler.h - Custom RISC-V MI scheduler ---*- C++ -*-===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===----------------------------------------------------------------------===// +// +// Custom RISC-V MI scheduler. +// +//===----------------------------------------------------------------------===// + +#ifndef LLVM_LIB_TARGET_RISCV_RISCVMACHINESCHEDULER_H +#define LLVM_LIB_TARGET_RISCV_RISCVMACHINESCHEDULER_H + +#include "llvm/CodeGen/MachineScheduler.h" +#include "llvm/TargetParser/RISCVTargetParser.h" + +namespace llvm { + +/// A GenericScheduler implementation for RISCV pre RA scheduling. +class RISCVPreRAMachineSchedStrategy : public GenericScheduler { +private: + RISCVII::VLMUL PrevVLMUL; + unsigned PrevVSEW; + +public: + RISCVPreRAMachineSchedStrategy(const MachineSchedContext *C) + : GenericScheduler(C) {} + +protected: + SUnit *pickNode(bool &IsTopNode) override; + + bool tryCandidate(SchedCandidate &Cand, SchedCandidate &TryCand, + SchedBoundary *Zone) const override; + + void schedNode(SUnit *SU, bool IsTopNode) override; +}; + +} // end namespace llvm + +#endif diff --git a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp index 35d0b3408d09f..e0dcbbddc3f53 100644 --- a/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp +++ b/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp @@ -14,6 +14,7 @@ #include "MCTargetDesc/RISCVBaseInfo.h" #include "RISCV.h" #include "RISCVMachineFunctionInfo.h" +#include "RISCVMachineScheduler.h" #include "RISCVTargetObjectFile.h" #include "RISCVTargetTransformInfo.h" #include "TargetInfo/RISCVTargetInfo.h" @@ -340,12 +341,11 @@ class RISCVPassConfig : public TargetPassConfig { ScheduleDAGInstrs * createMachineScheduler(MachineSchedContext *C) const override { - ScheduleDAGMILive *DAG = nullptr; - if (EnableMISchedLoadClustering) { - DAG = createGenericSchedLive(C); + ScheduleDAGMILive *DAG = + createGenericSchedLive<RISCVPreRAMachineSchedStrategy>(C); + if (EnableMISchedLoadClustering) DAG->addMutation(createLoadClusterDAGMutation( DAG->TII, DAG->TRI, /*ReorderWhileClustering=*/true)); - } return DAG; } diff --git a/llvm/test/CodeGen/RISCV/rvv/schedule.ll b/llvm/test/CodeGen/RISCV/rvv/schedule.ll new file mode 100644 index 0000000000000..baf15ef400df5 --- /dev/null +++ b/llvm/test/CodeGen/RISCV/rvv/schedule.ll @@ -0,0 +1,49 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -mtriple=riscv64 -mcpu=sifive-x280 -verify-machineinstrs < %s \ +; RUN: | FileCheck %s --check-prefix=DEFAULT +; RUN: llc -mtriple=riscv64 -mcpu=sifive-x280 -riscv-enable-schedule-same-vtype -verify-machineinstrs < %s \ +; RUN: | FileCheck %s --check-prefix=SAME-VTYPE-FIRST + +define <vscale x 1 x i64> @test(<vscale x 1 x i64> %v64_0, <vscale x 1 x i64> %v64_1, <vscale x 1 x i32> %v32_0, <vscale x 1 x i32> %v32_1) { +; DEFAULT-LABEL: test: +; DEFAULT: # %bb.0: # %entry +; DEFAULT-NEXT: vsetvli a0, zero, e64, m1, ta, ma +; DEFAULT-NEXT: vdiv.vv v12, v8, v9 +; DEFAULT-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; DEFAULT-NEXT: vdiv.vv v13, v10, v11 +; DEFAULT-NEXT: vsetvli zero, zero, e64, m1, ta, ma +; DEFAULT-NEXT: vadd.vv v8, v8, v9 +; DEFAULT-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; DEFAULT-NEXT: vadd.vv v9, v10, v11 +; DEFAULT-NEXT: vsetvli zero, zero, e64, m1, ta, ma +; DEFAULT-NEXT: vadd.vv v8, v8, v12 +; DEFAULT-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; DEFAULT-NEXT: vadd.vv v9, v9, v13 +; DEFAULT-NEXT: vwadd.wv v8, v8, v9 +; DEFAULT-NEXT: ret +; +; SAME-VTYPE-FIRST-LABEL: test: +; SAME-VTYPE-FIRST: # %bb.0: # %entry +; SAME-VTYPE-FIRST-NEXT: vsetvli a0, zero, e64, m1, ta, ma +; SAME-VTYPE-FIRST-NEXT: vadd.vv v12, v8, v9 +; SAME-VTYPE-FIRST-NEXT: vdiv.vv v8, v8, v9 +; SAME-VTYPE-FIRST-NEXT: vadd.vv v8, v12, v8 +; SAME-VTYPE-FIRST-NEXT: vsetvli zero, zero, e32, mf2, ta, ma +; SAME-VTYPE-FIRST-NEXT: vadd.vv v9, v10, v11 +; SAME-VTYPE-FIRST-NEXT: vdiv.vv v10, v10, v11 +; SAME-VTYPE-FIRST-NEXT: vadd.vv v9, v9, v10 +; SAME-VTYPE-FIRST-NEXT: vwadd.wv v8, v8, v9 +; SAME-VTYPE-FIRST-NEXT: ret +entry: + %0 = add <vscale x 1 x i64> %v64_0, %v64_1 + %1 = add <vscale x 1 x i32> %v32_0, %v32_1 + %2 = sdiv <vscale x 1 x i64> %v64_0, %v64_1 + %3 = sdiv <vscale x 1 x i32> %v32_0, %v32_1 + %4 = add <vscale x 1 x i64> %0, %2 + %5 = add <vscale x 1 x i32> %1, %3 + + %6 = sext <vscale x 1 x i32> %5 to <vscale x 1 x i64> + %7 = add <vscale x 1 x i64> %4, %6 + ret <vscale x 1 x i64> %7 +} +

michaelmaitland · 2024-06-18T13:37:43Z

llvm/test/CodeGen/RISCV/rvv/schedule.ll

+; SAME-VTYPE-FIRST-NEXT: vadd.vv v12, v8, v9
+; SAME-VTYPE-FIRST-NEXT: vdiv.vv v8, v8, v9
+; SAME-VTYPE-FIRST-NEXT: vadd.vv v8, v12, v8
+; SAME-VTYPE-FIRST-NEXT: vsetvli zero, zero, e32, mf2, ta, ma


This is a pretty cool idea. Do you know how this impacts performance on a benchmark like spec?

michaelmaitland

This is a pretty neat idea.

There are a few things we want to balance:

Reduce instruction count due to number of vtype toggles
Avoid the number of stalls due to latency (dependent result not ready)
Avoid the number of stalls due to resource consumption (resources not available)

I am curious how we will be able to balance the three of these. In the current state of this patch, we are prioritizing (1) and falling back to GenericScheduler::pickNode(IsTopNode) to handle (2) and (3) only in the cases when we don't have the ability to do (1). It is unclear to me whether (1) should be so important that we ignore (2) and (3).

It would be nice to have some data on how the current proposed approach impacts performance of benchmarks. I'd also be curious to explore balancing heuristic (1) with (2) and (3) to see how that impacts performance.

llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp

preames · 2024-06-18T14:46:39Z

Interesting prototype!

@michaelmaitland Already responded with a good summary of the concerns, so let me just second him.

My default would be to assume that the vtype toggles are pretty cheap, and that we should purely be using (1) to tie break when (2) and (3) don't order scheduling, but I'll freely admit I don't have any strong data on this. I'd encourage you to run a few workloads, and see what you get with different heuristics.

Created using spr 1.3.6-beta.1

michaelmaitland · 2024-06-20T03:01:29Z

llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp

+ if (SUnit *SU = FindPotentialRVVInstruction(Top, true))
+ return SU;
+ } else {
+ if (SUnit *SU =


In the GenericScheduler, we tend not to pick from the Pending queues. It is usually better to move the node to Available and keep pick functions to take from Available. Otherwise, there are two cases we run into:

HazardRecognizers try and keep nodes on the Pending queue and this code here will ignore that. It will be really hard to keep the intended functionality of HazardRecognizers if we pick from Pending.

A node is Pending because it will lead to stalls according to scheduler model. Picking from it ignores the scheduler model.

Do we really need to pick from Pending here?

No, I think we shouldn't. I did this just because I wanted a quick prototype 😄.
I will make it reasonable later.

I gave some more thought on this. I think we can reuse GenericScheduler::pickNode and we should instead be overriding pickNodeFromQueue. pickNodeFromQueue is where the real picking of the node happens. GenericScheduler::pickNode just gets the correct queue based on direction and passes it to pickNodeFromQueue. Something like this:

SUnit *RISCVPreRAMachineSchedStrategy::pickNodeFromQueue(SchedBoundary &Zone, const CandPolicy &ZonePolicy, const RegPressureTracker &RPTracker, SchedCandidate &Cand) { SchedCandidate RVVCand = FindRVVCandidate(Zone); GenericScheduler::pickNodeFromQueue(Zone, ZonePolicy, RPTracker, Cand); // Pass SchedBoundary only when comparing nodes from the same boundary. SchedBoundary *ZoneArg = Cand.AtTop == RVVCand.AtTop ? &Zone : nullptr; // TODO: we need to add our own heuristics here or inside an overriden // tryCandidate to make sure that we balance clustering RVV with same vtype // with the existing heuristics such as register pressure, latency, resource usage, etc. if (tryCandidate(RVVCand, Cand, ZoneArg) return RVVCand; return Cand; }

michaelmaitland · 2024-06-20T03:05:30Z

llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp

+ if (SUnit *SU =
+ FindPotentialRVVInstructionInQueue(Top, Top.Pending, true))
+ return SU;
+ }


In the bidirectional case, do you need to tryCandidate to compare whether the Top or Bot candidate is better?

Yeah, good point! Will do it in next revision!

wangpc-pp · 2024-06-20T03:06:21Z

We need to do auto-vectorization to generate RVV instructions, so I just test the runtime of TSVC on K230 board.
Options (use -mtune=sifive-p670 here just for vector scheduling):

before: clang -O2 -march=rv64gcv_zba_zbb_zbs_zbc -mtune=sifive-p670
after: clang -O2 -march=rv64gcv_zba_zbb_zbs_zbc -mtune=sifive-p670 -mllvm -riscv-enable-schedule-same-vtype

name	before	after	after/before
s2102	73.541	30.665	0.416978
s252	7.994	5.858	0.7328
s114	57.398	50.594	0.881459
s1111	31.729	28.35	0.893504
s4116	13.206	11.828	0.895653
s1421	22.705	20.353	0.89641
s4115	16.842	15.114	0.897399
s471	6.411	5.818	0.907503
s3251	36.247	33.349	0.920049
s422	49.604	45.715	0.921599
s4112	14.28	13.164	0.921849
s152	14.872	13.712	0.922001
s1115	70.36	65.187	0.926478
s128	16.384	15.225	0.92926
s141	55.395	51.796	0.93503
s2710	10.834	10.207	0.942127
s424	23.916	22.763	0.95179
s1213	27.269	25.964	0.952143
s221	16.418	15.638	0.952491
s122	7.898	7.552	0.956191
s4114	16.081	15.385	0.956719
s1119	9.899	9.48	0.957672
s119	10.922	10.521	0.963285
s276	67.157	64.79	0.964754
s491	15.247	14.725	0.965764
s4113	18.565	17.954	0.967089
s323	27.369	26.769	0.978077
s118	36.679	35.953	0.980207
s279	13.361	13.124	0.982262
s353	19.059	18.731	0.98279
s256	22.725	22.37	0.984378
s222	18.044	17.763	0.984427
s243	14.326	14.125	0.98597
s115	34.492	34.13	0.989505
s1281	55.798	55.236	0.989928
s318	14.222	14.121	0.992898
s241	45.126	44.829	0.993418
s274	14.573	14.482	0.993756
s1251	55.382	55.073	0.994421
s111	11.033	10.981	0.995287
s351	54.015	53.796	0.995946
vif	4.387	4.37	0.996125
s292	9.011	8.978	0.996338
s258	0.293	0.292	0.996587
s3110	17.651	17.594	0.996771
s116	64.726	64.517	0.996771
s321	11.351	11.32	0.997269
s242	8.126	8.104	0.997293
s442	11.146	11.118	0.997488
s453	14.043	14.012	0.997792
s3112	7.48	7.464	0.997861
s291	16.383	16.348	0.997864
s162	7.632	7.616	0.997904
s124	9.485	9.469	0.998313
s251	30.493	30.448	0.998524
s1221	4.267	4.261	0.998594
s341	25.323	25.292	0.998776
s313	36.626	36.582	0.998799
vdotr	73.265	73.178	0.998813
s481	28.464	28.441	0.999192
s123	17.599	17.586	0.999261
s482	27.494	27.481	0.999527
s1113	25.105	25.094	0.999562
vtvtv	29.63	29.619	0.999629
vag	22.7	22.692	0.999648
s2111	52.88	52.865	0.999716
s322	10.58	10.577	0.999716
s332	14.874	14.87	0.999731
s3111	7.481	7.479	0.999733
s352	85.591	85.574	0.999801
vsumr	62.344	62.334	0.99984
s1161	28.72	28.716	0.999861
s132	18.589	18.587	0.999892
s317	22.739	22.737	0.999912
s316	74.607	74.601	0.99992
s000	6.71	6.71	1
s113	15.401	15.401	1
s331	12.723	12.723	1
s314	84.298	84.299	1.000012
s277	22.83	22.835	1.000219
vpvpv	29.128	29.135	1.00024
s312	83.805	83.826	1.000251
s3113	92.538	92.565	1.000292
s443	20.051	20.057	1.000299
s311	62.329	62.351	1.000353
s4121	7.283	7.286	1.000412
s271	29.64	29.657	1.000574
s342	29.462	29.485	1.000781
s1279	11.357	11.366	1.000792
s261	20.939	20.964	1.001194
vpvtv	29.442	29.482	1.001359
s2244	7.904	7.915	1.001392
s2711	29.644	29.686	1.001417
s452	29.449	29.494	1.001528
s112	27.384	27.45	1.00241
s173	25.803	25.872	1.002674
s315	21.168	21.241	1.003449
s1351	40.024	40.163	1.003473
s244	21.168	21.243	1.003543
s232	4.743	4.767	1.00506
s278	16.781	16.872	1.005423
s2712	30.811	31.016	1.006653
s174	25.666	25.859	1.00752
vbor	2.419	2.438	1.007854
s161	22.161	22.336	1.007897
s212	18.311	18.493	1.009939
s121	18.121	18.305	1.010154
s151	30.192	30.499	1.010168
vpvts	5.893	5.953	1.010182
s127	10.654	10.764	1.010325
s175	6.044	6.107	1.010424
s131	30.186	30.502	1.010468
s176	5.869	5.933	1.010905
s171	5.608	5.673	1.011591
s431	56.242	56.915	1.011966
vtv	56.171	56.866	1.012373
vpv	56.18	56.909	1.012976
s2251	33.527	33.963	1.013004
s172	5.61	5.685	1.013369
s273	14.475	14.671	1.013541
s423	28.373	28.785	1.014521
s2101	18.673	18.945	1.014566
s319	120.721	122.631	1.015822
vas	20.567	20.901	1.01624
s293	4.169	4.25	1.019429
s2233	251.383	256.287	1.019508
s451	17.201	17.579	1.021975
s126	22.997	23.542	1.023699
s257	42.741	43.761	1.023865
s235	459.815	470.903	1.024114
s231	232.829	238.524	1.02446
s441	15.573	15.96	1.024851
s281	28.693	29.479	1.027393
s1232	237.11	244.2	1.029902
s233	662.192	684.512	1.033706
s253	11.363	11.769	1.03573
s125	7.407	7.688	1.037937
s211	26.624	27.68	1.039663
s13110	17.562	18.347	1.044699
s275	38.142	39.928	1.046825
va	24.023	25.205	1.049203
s421	25.037	26.426	1.055478
s1244	40.337	42.873	1.06287
s343	23.314	24.934	1.069486
s1112	14.517	15.864	1.092788
s272	11.442	12.529	1.095001
s2275	721.855	799.062	1.106956
s254	21.049	24.103	1.14509
s31111	12.804	16.004	1.249922
s255	7.586	9.541	1.257712
s4117	8.046	10.265	1.275789
Geomean			0.993474

We can see some improvements and some regressions as well. In total, we don't have much gain here (about 0.65%).
The result is highly implementation-specific, and it may not be so convincing.
I will do more benchmarking and improve the heuristics.

BeMg · 2024-06-20T06:29:36Z

Except the latency and instruction count, the register pressure also need to be concerned. IIRC GenericScheduler::tryCandidate should responsible for it.

michaelmaitland · 2024-06-20T15:42:44Z

~~What do you think about the following idea:~~
~~1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype~~
~~2. Run RISCVVSETVLIInsertion~~
~~3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources~~
~~4. Run register allocation~~
~~5. If the subtarget enables PostMachineScheduler, run it.`~~

This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.

EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

BeMg · 2024-06-21T03:11:20Z

~~What do you think about the following idea:~~ ~~1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype~~ ~~2. Run RISCVVSETVLIInsertion~~ ~~3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources~~ ~~4. Run register allocation~~ ~~5. If the subtarget enables PostMachineScheduler, run it.`~~

This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.

EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

We could use mutation to constrain the same group of vtype instructions instead of the vsetvli insertion to create barriers between instructions. Based on my experience, this approach still disrupts some patterns in step 3. At best, it eliminates some vsetvli instructions; at worst, it introduces additional spills and reloads. This doesn't seem ideal.

wangpc-pp · 2024-06-21T08:12:51Z

~~What do you think about the following idea:~~ ~~1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype~~ ~~2. Run RISCVVSETVLIInsertion~~ ~~3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources~~ ~~4. Run register allocation~~ 5. If the subtarget enables PostMachineScheduler, run it.~ ~This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli` instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.
EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

We could use mutation to constrain the same group of vtype instructions instead of the vsetvli insertion to create barriers between instructions. Based on my experience, this approach still disrupts some patterns in step 3. At best, it eliminates some vsetvli instructions; at worst, it introduces additional spills and reloads. This doesn't seem ideal.

I have thought about the mutation way before but I didn't have a try. I think that can be another feasible approach. Do you have a prototype that can be evaluated?

BeMg · 2024-06-22T05:51:09Z

~~What do you think about the following idea:~~ ~~1. RISCVMachineScheduler does RISCVMachineScheduler::pickNodeFromQueue, and the only job is to group RVV instructions according to same vtype~~ ~~2. Run RISCVVSETVLIInsertion~~ ~~3. Run MachineScheduler, whose job is to put the instructions in a good order for register allocation, also taking into account latencies and processor resources~~ ~~4. Run register allocation~~ 5. If the subtarget enables PostMachineScheduler, run it.~ ~This approach would keep the RISCVMachineScheduler simple, since it could ignore register pressure, latencies, and processor resource usage. By running the normal MachineScheduler after VSETVLI insertion, the hope is that we have more freedom on scheduling since less vsetvli` instructions means less instruction dependencies, meaning more scheduler freedom. At this point, we are accounting for register pressure, latencies, and processor resources.
EDIT: I forgot RISCVVSETVLIInsertion is after RA, so you can ignore this idea. You probably need to balance grouping vtypes, latencies, register pressures, and resource usage at the same time, otherwise individual pass approach will undo changes made in the first pass.

We could use mutation to constrain the same group of vtype instructions instead of the vsetvli insertion to create barriers between instructions. Based on my experience, this approach still disrupts some patterns in step 3. At best, it eliminates some vsetvli instructions; at worst, it introduces additional spills and reloads. This doesn't seem ideal.

I have thought about the mutation way before but I didn't have a try. I think that can be another feasible approach. Do you have a prototype that can be evaluated?

Sure. I have two prototype to share.

BeMg@a5c4dfb Use mutation to create cluster dependence between instruction in same vsetvl configuration
BeMg@3e63dc1 Overload tryCandidate to make vsetvl be aware by machine scheduler.

They both reuse the vsetvli pass VSETVLInfo to check two instruction exist the same configuration.

The second one is more like approach in this patch but modifying tryCandidate to change the heuristic. It is a way to implement the tie break between vtypes, latencies, register pressures, and resource usage.

I think these PoC still has room to improve but enough to be evaluated for data to compare. (Like mutation could use the weak dependence instead of cluster, custom scheduleStrategy could change the vsetvli-aware priority)

If there are any useful thing you find in these prototype, feel free to integrate into this patch.

Mutation SPEC2k17 data

	Before Spills	After Spills	Spills Diff	Before Reloads	After Relods	Reloads Diff	Before vsetvl	After vsetvl	vsetvl diff
500.perlbench_r	4307	4307	0	10658	10658	0	529	530	-1
502.gcc_r	13239	13242	-3	29697	29701	-4	2857	2793	64
505.mcf_r	119	119	0	330	330	0	38	38	0
520.omnetpp_r	796	904	-108	1552	1855	-303	1005	1016	-11
523.xalancbmk_r	1835	1835	0	2563	2563	0	2863	2854	9
525.x264_r	4131	4135	-4	8033	8051	-18	3035	2908	127
531.deepsjeng_r	328	343	-15	677	695	-18	313	328	-15
541.leela_r	365	365	0	591	591	0	142	133	9
548.exchange2_r	0	0	0	0	0	0	0	0	0
557.xz_r	347	362	-15	697	711	-14	252	259	-7

Custom Sched SPEC2k17 data

	Before Spills	After Spills	Spills Diff	Before Reloads	After Relods	Reloads Diff	Before vsetvl	After vsetvl	vsetvl diff
500.perlbench_r	4307	4307	0	10658	10658	0	529	529	0
502.gcc_r	13239	13236	3	29697	29694	3	2857	2843	14
505.mcf_r	119	119	0	330	330	0	38	38	0
520.omnetpp_r	796	796	0	1552	1552	0	1005	1003	2
523.xalancbmk_r	1835	1838	-3	2563	2557	6	2863	2851	12
525.x264_r	4131	4133	-2	8033	8037	-4	3035	2950	85
531.deepsjeng_r	328	328	0	677	677	0	313	312	1
541.leela_r	365	365	0	591	591	0	142	142	0
548.exchange2_r	0	0	0	0	0	0	0	0	0
557.xz_r	347	347	0	697	697	0	252	253	-1

wangpc-pp · 2024-08-26T08:17:11Z

@BeMg Hi Piyou, I don't have much time to make this PR further, if you have interests and time on this, you can continue the work you have done before, thanks! :-)

Created using spr 1.3.6-beta.1

wangpc-pp · 2025-12-02T08:58:52Z

I'd like to push this forward. I don't know if @BeMg has some progresses here?

I want to divide this into several parts:

Separate the computation of vtype info in RISCVInsertVSETVLI.cpp into a reusable standalone file.
Add an RISC-V specific scheduler (may do nothing).
Add the new vtype-based scheduling heuristic.

cc @BeMg @lukel97 @topperc @preames @mshockwave

Created using spr 1.3.6-beta.1

github-actions · 2025-12-02T09:26:49Z

🐧 Linux x64 Test Results

166592 tests passed
2891 tests skipped
1 test failed

Failed Tests

(click on a test name to see its output)

LLVM

LLVM.CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll (Likely Already Failing)

This test is already failing at the base commit.

Exit Code: 1 Command Output (stdout): -- # RUN: at line 2 /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 < /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll | /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck -check-prefix=GISEL-GFX12 /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll # executed command: /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/llc -global-isel=1 -new-reg-bank-select -mtriple=amdgcn-amd-amdpal -mcpu=gfx1200 # note: command had no output on stdout or stderr # executed command: /home/gha/actions-runner/_work/llvm-project/llvm-project/build/bin/FileCheck -check-prefix=GISEL-GFX12 /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll # .---command stderr------------ # | /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll:109:21: error: GISEL-GFX12-NEXT: expected string not found in input # | ; GISEL-GFX12-NEXT: s_wait_alu 0xfffe # | ^ # | <stdin>:103:17: note: scanning from here # | s_alloc_vgpr 64 # | ^ # | <stdin>:104:2: note: possible intended match here # | s_wait_alu depctr_sa_sdst(0) # | ^ # | # | Input file: <stdin> # | Check file: /home/gha/actions-runner/_work/llvm-project/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn-cs-chain-intrinsic-dyn-vgpr-w32.ll # | # | -dump-input=help explains the following input dump. # | # | Input was: # | <<<<<< # | . # | . # | . # | 98: s_wait_kmcnt 0x0 # | 99: s_mov_b32 s30, callee_high_sgpr@abs32@lo # | 100: s_mov_b32 s31, callee_high_sgpr@abs32@hi # | 101: s_mov_b32 s34, retry_vgpr_alloc@abs32@lo # | 102: s_mov_b32 s35, retry_vgpr_alloc@abs32@hi # | 103: s_alloc_vgpr 64 # | next:109'0 X error: no match found # | 104: s_wait_alu depctr_sa_sdst(0) # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | next:109'1 ? possible intended match # | 105: s_cselect_b64 s[30:31], s[30:31], s[34:35] # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | 106: s_cselect_b32 exec_lo, 7, -1 # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | 107: s_wait_alu depctr_sa_sdst(0) # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | 108: s_setpc_b64 s[30:31] # | next:109'0 ~~~~~~~~~~~~~~~~~~~~~~ # | 109: .Lfunc_end2: # | next:109'0 ~~~~~~~~~~~~~ # | . # | . # | . # | >>>>>> # `----------------------------- # error: command failed with exit status: 1 --

If these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the infrastructure label.

[𝘀𝗽𝗿] initial version

5ac4ff3

Created using spr 1.3.6-beta.1

llvmbot added the backend:RISC-V label Jun 18, 2024

wangpc-pp requested review from asb and lukel97 and removed request for asb June 18, 2024 13:33

wangpc-pp requested review from asb, michaelmaitland, preames and topperc June 18, 2024 13:34

michaelmaitland reviewed Jun 18, 2024

View reviewed changes

llvm/lib/Target/RISCV/RISCVMachineScheduler.cpp Outdated Show resolved Hide resolved

kito-cheng requested a review from BeMg June 18, 2024 14:52

Support buttomup/topdown/bidirectional and fix some failures

185e0f8

Created using spr 1.3.6-beta.1

michaelmaitland reviewed Jun 20, 2024

View reviewed changes

wangpc-pp mentioned this pull request Jun 5, 2025

[RISC-V V] Idea: Interleave independent op-chains by vsetvli category #142814

Open

Rebase and rework

efeaf9e

Created using spr 1.3.6-beta.1

wangpc-pp changed the title ~~[RISCV][PoC] Schedule RVV instructions with same type first~~ [RISCV] Schedule RVV instructions with compatible type first Dec 2, 2025

Revert llvm/lib/CodeGen/MachineScheduler.cpp change

b86faa8

Created using spr 1.3.6-beta.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV] Schedule RVV instructions with compatible type first #95924

[RISCV] Schedule RVV instructions with compatible type first #95924

wangpc-pp commented Jun 18, 2024 •

edited

Loading

llvmbot commented Jun 18, 2024

michaelmaitland Jun 18, 2024

michaelmaitland left a comment •

edited

Loading

Uh oh!

preames commented Jun 18, 2024

michaelmaitland Jun 20, 2024 •

edited

Loading

wangpc-pp Jun 20, 2024

michaelmaitland Jun 20, 2024 •

edited

Loading

michaelmaitland Jun 20, 2024

wangpc-pp Jun 20, 2024

wangpc-pp commented Jun 20, 2024

BeMg commented Jun 20, 2024

michaelmaitland commented Jun 20, 2024 •

edited

Loading

BeMg commented Jun 21, 2024

wangpc-pp commented Jun 21, 2024

BeMg commented Jun 22, 2024

wangpc-pp commented Aug 26, 2024

wangpc-pp commented Dec 2, 2025

github-actions bot commented Dec 2, 2025

Labels

6 participants

[RISCV] Schedule RVV instructions with compatible type first #95924

Are you sure you want to change the base?

[RISCV] Schedule RVV instructions with compatible type first #95924

Conversation

wangpc-pp commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

llvmbot commented Jun 18, 2024

michaelmaitland Jun 18, 2024

Choose a reason for hiding this comment

michaelmaitland left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

preames commented Jun 18, 2024

michaelmaitland Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

wangpc-pp Jun 20, 2024

Choose a reason for hiding this comment

michaelmaitland Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

michaelmaitland Jun 20, 2024

Choose a reason for hiding this comment

wangpc-pp Jun 20, 2024

Choose a reason for hiding this comment

wangpc-pp commented Jun 20, 2024

BeMg commented Jun 20, 2024

michaelmaitland commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BeMg commented Jun 21, 2024

wangpc-pp commented Jun 21, 2024

BeMg commented Jun 22, 2024

Mutation SPEC2k17 data

Custom Sched SPEC2k17 data

wangpc-pp commented Aug 26, 2024

wangpc-pp commented Dec 2, 2025

github-actions bot commented Dec 2, 2025

🐧 Linux x64 Test Results

Failed Tests

LLVM

Labels

6 participants

wangpc-pp commented Jun 18, 2024 •

edited

Loading

michaelmaitland left a comment •

edited

Loading

michaelmaitland Jun 20, 2024 •

edited

Loading

michaelmaitland Jun 20, 2024 •

edited

Loading

michaelmaitland commented Jun 20, 2024 •

edited

Loading