- Notifications
You must be signed in to change notification settings - Fork 15.3k
[X86] Lower minimum/maximum/minimumnum/maximumnum using bitwise operations #170069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
| @llvm/pr-subscribers-backend-x86 Author: None (valadaptive) ChangesI got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:
We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:
In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway. Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is. I believe this approach is faster than even properly-vectorized As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses Patch is 293.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170069.diff 6 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 1b0bf6823e390..b6db3706c5d3a 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -29480,7 +29480,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, uint64_t SizeInBits = VT.getScalarSizeInBits(); APInt PreferredZero = APInt::getZero(SizeInBits); APInt OppositeZero = PreferredZero; - EVT IVT = VT.changeTypeToInteger(); X86ISD::NodeType MinMaxOp; if (IsMaxOp) { MinMaxOp = X86ISD::FMAX; @@ -29492,8 +29491,8 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT); - // The tables below show the expected result of Max in cases of NaN and - // signed zeros. + // The tables below show the expected result of Max in cases of NaN and signed + // zeros. // // Y Y // Num xNaN +0 -0 @@ -29503,12 +29502,9 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, // xNaN | X | X/Y | -0 | +0 | -0 | // --------------- --------------- // - // It is achieved by means of FMAX/FMIN with preliminary checks and operand - // reordering. - // - // We check if any of operands is NaN and return NaN. Then we check if any of - // operands is zero or negative zero (for fmaximum and fminimum respectively) - // to ensure the correct zero is returned. + // It is achieved by means of FMAX/FMIN with preliminary checks, operand + // reordering if one operand is a constant, and bitwise operations and selects + // to handle signed zero and NaN operands otherwise. auto MatchesZero = [](SDValue Op, APInt Zero) { Op = peekThroughBitcasts(Op); if (auto *CstOp = dyn_cast<ConstantFPSDNode>(Op)) @@ -29539,15 +29535,17 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, Op->getFlags().hasNoSignedZeros() || DAG.isKnownNeverZeroFloat(X) || DAG.isKnownNeverZeroFloat(Y); - SDValue NewX, NewY; + bool ShouldHandleZeros = true; + SDValue NewX = X; + SDValue NewY = Y; if (IgnoreSignedZero || MatchesZero(Y, PreferredZero) || MatchesZero(X, OppositeZero)) { // Operands are already in right order or order does not matter. - NewX = X; - NewY = Y; + ShouldHandleZeros = false; } else if (MatchesZero(X, PreferredZero) || MatchesZero(Y, OppositeZero)) { NewX = Y; NewY = X; + ShouldHandleZeros = false; } else if (!VT.isVector() && (VT == MVT::f16 || Subtarget.hasDQI()) && (Op->getFlags().hasNoNaNs() || IsXNeverNaN || IsYNeverNaN)) { if (IsXNeverNaN) @@ -29569,33 +29567,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, NewX = DAG.getSelect(DL, VT, NeedSwap, Y, X); NewY = DAG.getSelect(DL, VT, NeedSwap, X, Y); return DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags()); - } else { - SDValue IsXSigned; - if (Subtarget.is64Bit() || VT != MVT::f64) { - SDValue XInt = DAG.getNode(ISD::BITCAST, DL, IVT, X); - SDValue ZeroCst = DAG.getConstant(0, DL, IVT); - IsXSigned = DAG.getSetCC(DL, SetCCType, XInt, ZeroCst, ISD::SETLT); - } else { - assert(VT == MVT::f64); - SDValue Ins = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, MVT::v2f64, - DAG.getConstantFP(0, DL, MVT::v2f64), X, - DAG.getVectorIdxConstant(0, DL)); - SDValue VX = DAG.getNode(ISD::BITCAST, DL, MVT::v4f32, Ins); - SDValue Hi = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VX, - DAG.getVectorIdxConstant(1, DL)); - Hi = DAG.getBitcast(MVT::i32, Hi); - SDValue ZeroCst = DAG.getConstant(0, DL, MVT::i32); - EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(), - *DAG.getContext(), MVT::i32); - IsXSigned = DAG.getSetCC(DL, SetCCType, Hi, ZeroCst, ISD::SETLT); - } - if (MinMaxOp == X86ISD::FMAX) { - NewX = DAG.getSelect(DL, VT, IsXSigned, X, Y); - NewY = DAG.getSelect(DL, VT, IsXSigned, Y, X); - } else { - NewX = DAG.getSelect(DL, VT, IsXSigned, Y, X); - NewY = DAG.getSelect(DL, VT, IsXSigned, X, Y); - } } bool IgnoreNaN = DAG.getTarget().Options.NoNaNsFPMath || @@ -29612,10 +29583,80 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, SDValue MinMax = DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags()); + // We handle signed-zero ordering by taking the larger (or smaller) sign bit. + if (ShouldHandleZeros) { + const fltSemantics &Sem = VT.getFltSemantics(); + unsigned EltBits = VT.getScalarSizeInBits(); + bool IsFakeVector = !VT.isVector(); + MVT LogicVT = VT.getSimpleVT(); + if (IsFakeVector) + LogicVT = (VT == MVT::f64) ? MVT::v2f64 + : (VT == MVT::f32) ? MVT::v4f32 + : MVT::v8f16; + + // We take the sign bit from the first operand and combine it with the + // output sign bit (see below). Right now, if ShouldHandleZeros is true, the + // operands will never have been swapped. If you add another optimization + // that swaps the input operands if one is a known value, make sure this + // logic stays correct! + SDValue LogicX = NewX; + SDValue LogicMinMax = MinMax; + if (IsFakeVector) { + // Promote scalars to vectors for bitwise operations. + LogicX = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, NewX); + LogicMinMax = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, MinMax); + } + + // x86's min/max operations return the second operand if both inputs are + // signed zero. For the maximum operation, we want to "and" the sign bit of + // the output with the sign bit of the first operand--that means that if the + // first operand is +0.0, the output will be too. For the minimum, it's the + // opposite: we "or" the output sign bit with the sign bit of the first + // operand, ensuring that if the first operand is -0.0, the output will be + // too. + SDValue Result; + if (IsMaxOp) { + // getSignedMaxValue returns a bit pattern of all ones but the highest + // bit. We "or" that with the first operand, then "and" that with the max + // operation's result. That clears only the sign bit, and only if the + // first operand is positive. + SDValue OrMask = DAG.getConstantFP( + APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT); + SDValue MaskedSignBit = + DAG.getNode(X86ISD::FOR, DL, LogicVT, LogicX, OrMask); + Result = + DAG.getNode(X86ISD::FAND, DL, LogicVT, MaskedSignBit, LogicMinMax); + } else { + // Likewise, getSignedMinValue returns a bit pattern with only the highest + // bit set. This one *sets* only the sign bit, and only if the first + // operand is *negative*. + SDValue AndMask = DAG.getConstantFP( + APFloat(Sem, APInt::getSignMask(EltBits)), DL, LogicVT); + SDValue MaskedSignBit = + DAG.getNode(X86ISD::FAND, DL, LogicVT, LogicX, AndMask); + Result = + DAG.getNode(X86ISD::FOR, DL, LogicVT, MaskedSignBit, LogicMinMax); + } + + // Extract scalar back from vector. + if (IsFakeVector) + MinMax = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Result, + DAG.getVectorIdxConstant(0, DL)); + else + MinMax = Result; + } + if (IgnoreNaN || DAG.isKnownNeverNaN(IsNum ? NewY : NewX)) return MinMax; - SDValue NaNSrc = IsNum ? MinMax : NewX; + // The x86 min/max return the second operand if either is NaN, which doesn't + // match the numeric or non-numeric semantics. For the non-numeric versions, + // we want to return NaN if either operand is NaN. To do that, we check if + // NewX (the first operand) is NaN, and select it if so. For the numeric + // versions, we want to return the non-NaN operand if there is one. So we + // check if NewY (the second operand) is NaN, and again select the first + // operand if so. + SDValue NaNSrc = IsNum ? NewY : NewX; SDValue IsNaN = DAG.getSetCC(DL, SetCCType, NaNSrc, NaNSrc, ISD::SETUO); return DAG.getSelect(DL, VT, IsNaN, NewX, MinMax); diff --git a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll index 53ac283170a5f..59cf38f82b7c0 100644 --- a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll +++ b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll @@ -13,14 +13,9 @@ declare <32 x half> @llvm.maximum.v32f16(<32 x half>, <32 x half>) define half @test_fminimum(half %x, half %y) { ; CHECK-LABEL: test_fminimum: ; CHECK: # %bb.0: -; CHECK-NEXT: vmovw %xmm0, %eax -; CHECK-NEXT: testw %ax, %ax -; CHECK-NEXT: sets %al -; CHECK-NEXT: kmovd %eax, %k1 -; CHECK-NEXT: vmovaps %xmm1, %xmm2 -; CHECK-NEXT: vmovsh %xmm0, %xmm0, %xmm2 {%k1} -; CHECK-NEXT: vmovsh %xmm1, %xmm0, %xmm0 {%k1} -; CHECK-NEXT: vminsh %xmm2, %xmm0, %xmm1 +; CHECK-NEXT: vminsh %xmm1, %xmm0, %xmm2 +; CHECK-NEXT: vpbroadcastw {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; CHECK-NEXT: vpternlogq {{.*#+}} xmm1 = (xmm1 & xmm0) | xmm2 ; CHECK-NEXT: vcmpunordsh %xmm0, %xmm0, %k1 ; CHECK-NEXT: vmovsh %xmm0, %xmm0, %xmm1 {%k1} ; CHECK-NEXT: vmovaps %xmm1, %xmm0 @@ -92,16 +87,12 @@ define half @test_fminimum_combine_cmps(half %x, half %y) { define half @test_fmaximum(half %x, half %y) { ; CHECK-LABEL: test_fmaximum: ; CHECK: # %bb.0: -; CHECK-NEXT: vmovw %xmm0, %eax -; CHECK-NEXT: testw %ax, %ax -; CHECK-NEXT: sets %al -; CHECK-NEXT: kmovd %eax, %k1 -; CHECK-NEXT: vmovaps %xmm0, %xmm2 -; CHECK-NEXT: vmovsh %xmm1, %xmm0, %xmm2 {%k1} +; CHECK-NEXT: vmaxsh %xmm1, %xmm0, %xmm2 +; CHECK-NEXT: vpbroadcastw {{.*#+}} xmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN] +; CHECK-NEXT: vpternlogq {{.*#+}} xmm1 = xmm2 & (xmm1 | xmm0) +; CHECK-NEXT: vcmpunordsh %xmm0, %xmm0, %k1 ; CHECK-NEXT: vmovsh %xmm0, %xmm0, %xmm1 {%k1} -; CHECK-NEXT: vmaxsh %xmm2, %xmm1, %xmm0 -; CHECK-NEXT: vcmpunordsh %xmm1, %xmm1, %k1 -; CHECK-NEXT: vmovsh %xmm1, %xmm0, %xmm0 {%k1} +; CHECK-NEXT: vmovaps %xmm1, %xmm0 ; CHECK-NEXT: retq %r = call half @llvm.maximum.f16(half %x, half %y) ret half %r @@ -196,10 +187,9 @@ define <16 x half> @test_fmaximum_v16f16_nans(<16 x half> %x, <16 x half> %y) "n define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "no-nans-fp-math"="true" { ; CHECK-LABEL: test_fminimum_v32f16_szero: ; CHECK: # %bb.0: -; CHECK-NEXT: vpmovw2m %zmm0, %k1 -; CHECK-NEXT: vpblendmw %zmm0, %zmm1, %zmm2 {%k1} -; CHECK-NEXT: vmovdqu16 %zmm1, %zmm0 {%k1} -; CHECK-NEXT: vminph %zmm2, %zmm0, %zmm0 +; CHECK-NEXT: vminph %zmm1, %zmm0, %zmm1 +; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; CHECK-NEXT: vpternlogq {{.*#+}} zmm0 = (zmm0 & zmm2) | zmm1 ; CHECK-NEXT: retq %r = call <32 x half> @llvm.minimum.v32f16(<32 x half> %x, <32 x half> %y) ret <32 x half> %r @@ -208,12 +198,12 @@ define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) " define <32 x half> @test_fmaximum_v32f16_nans_szero(<32 x half> %x, <32 x half> %y) { ; CHECK-LABEL: test_fmaximum_v32f16_nans_szero: ; CHECK: # %bb.0: -; CHECK-NEXT: vpmovw2m %zmm0, %k1 -; CHECK-NEXT: vpblendmw %zmm1, %zmm0, %zmm2 {%k1} +; CHECK-NEXT: vmaxph %zmm1, %zmm0, %zmm2 +; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN] +; CHECK-NEXT: vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0) +; CHECK-NEXT: vcmpunordph %zmm0, %zmm0, %k1 ; CHECK-NEXT: vmovdqu16 %zmm0, %zmm1 {%k1} -; CHECK-NEXT: vmaxph %zmm2, %zmm1, %zmm0 -; CHECK-NEXT: vcmpunordph %zmm1, %zmm1, %k1 -; CHECK-NEXT: vmovdqu16 %zmm1, %zmm0 {%k1} +; CHECK-NEXT: vmovdqa64 %zmm1, %zmm0 ; CHECK-NEXT: retq %r = call <32 x half> @llvm.maximum.v32f16(<32 x half> %x, <32 x half> %y) ret <32 x half> %r diff --git a/llvm/test/CodeGen/X86/extractelement-fp.ll b/llvm/test/CodeGen/X86/extractelement-fp.ll index 1706f17eac165..9ce1fd6f20976 100644 --- a/llvm/test/CodeGen/X86/extractelement-fp.ll +++ b/llvm/test/CodeGen/X86/extractelement-fp.ll @@ -675,37 +675,23 @@ define double @fminnum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { ; X64-LABEL: fmaximum_v4f32: ; X64: # %bb.0: -; X64-NEXT: vmovd %xmm0, %eax -; X64-NEXT: testl %eax, %eax -; X64-NEXT: js .LBB30_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: jmp .LBB30_3 -; X64-NEXT: .LBB30_1: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: vmovdqa %xmm0, %xmm1 -; X64-NEXT: .LBB30_3: -; X64-NEXT: vmaxss %xmm2, %xmm1, %xmm0 -; X64-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2 -; X64-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 +; X64-NEXT: vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN] +; X64-NEXT: vorps %xmm2, %xmm0, %xmm2 +; X64-NEXT: vmaxss %xmm1, %xmm0, %xmm1 +; X64-NEXT: vandps %xmm1, %xmm2, %xmm1 +; X64-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 +; X64-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: retq ; ; X86-LABEL: fmaximum_v4f32: ; X86: # %bb.0: -; X86-NEXT: vmovd %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB30_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovdqa %xmm0, %xmm2 -; X86-NEXT: jmp .LBB30_3 -; X86-NEXT: .LBB30_1: -; X86-NEXT: vmovdqa %xmm1, %xmm2 -; X86-NEXT: vmovdqa %xmm0, %xmm1 -; X86-NEXT: .LBB30_3: ; X86-NEXT: pushl %eax -; X86-NEXT: vmaxss %xmm2, %xmm1, %xmm0 -; X86-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2 -; X86-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 +; X86-NEXT: vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN] +; X86-NEXT: vorps %xmm2, %xmm0, %xmm2 +; X86-NEXT: vmaxss %xmm1, %xmm0, %xmm1 +; X86-NEXT: vandps %xmm1, %xmm2, %xmm1 +; X86-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 +; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovss %xmm0, (%esp) ; X86-NEXT: flds (%esp) ; X86-NEXT: popl %eax @@ -718,41 +704,25 @@ define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { ; X64-LABEL: fmaximum_v4f64: ; X64: # %bb.0: -; X64-NEXT: vmovq %xmm0, %rax -; X64-NEXT: testq %rax, %rax -; X64-NEXT: js .LBB31_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: jmp .LBB31_3 -; X64-NEXT: .LBB31_1: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: vmovdqa %xmm0, %xmm1 -; X64-NEXT: .LBB31_3: -; X64-NEXT: vmaxsd %xmm2, %xmm1, %xmm0 -; X64-NEXT: vcmpunordsd %xmm1, %xmm1, %xmm2 -; X64-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 +; X64-NEXT: vmaxsd %xmm1, %xmm0, %xmm1 +; X64-NEXT: vorpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2 +; X64-NEXT: vandpd %xmm1, %xmm2, %xmm1 +; X64-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 +; X64-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: vzeroupper ; X64-NEXT: retq ; ; X86-LABEL: fmaximum_v4f64: ; X86: # %bb.0: -; X86-NEXT: vextractps $1, %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB31_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovapd %xmm0, %xmm2 -; X86-NEXT: jmp .LBB31_3 -; X86-NEXT: .LBB31_1: -; X86-NEXT: vmovapd %xmm1, %xmm2 -; X86-NEXT: vmovapd %xmm0, %xmm1 -; X86-NEXT: .LBB31_3: ; X86-NEXT: pushl %ebp ; X86-NEXT: movl %esp, %ebp ; X86-NEXT: andl $-8, %esp ; X86-NEXT: subl $8, %esp -; X86-NEXT: vmaxsd %xmm2, %xmm1, %xmm0 -; X86-NEXT: vcmpunordsd %xmm1, %xmm1, %xmm2 -; X86-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 +; X86-NEXT: vmaxsd %xmm1, %xmm0, %xmm1 +; X86-NEXT: vorpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2 +; X86-NEXT: vandpd %xmm1, %xmm2, %xmm1 +; X86-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 +; X86-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovlpd %xmm0, (%esp) ; X86-NEXT: fldl (%esp) ; X86-NEXT: movl %ebp, %esp @@ -767,35 +737,21 @@ define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { ; X64-LABEL: fminimum_v4f32: ; X64: # %bb.0: -; X64-NEXT: vmovd %xmm0, %eax -; X64-NEXT: testl %eax, %eax -; X64-NEXT: js .LBB32_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: jmp .LBB32_3 -; X64-NEXT: .LBB32_1: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: vmovdqa %xmm1, %xmm0 -; X64-NEXT: .LBB32_3: -; X64-NEXT: vminss %xmm2, %xmm0, %xmm1 +; X64-NEXT: vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; X64-NEXT: vandps %xmm2, %xmm0, %xmm2 +; X64-NEXT: vminss %xmm1, %xmm0, %xmm1 +; X64-NEXT: vorps %xmm1, %xmm2, %xmm1 ; X64-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 ; X64-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: retq ; ; X86-LABEL: fminimum_v4f32: ; X86: # %bb.0: -; X86-NEXT: vmovd %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB32_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovdqa %xmm1, %xmm2 -; X86-NEXT: jmp .LBB32_3 -; X86-NEXT: .LBB32_1: -; X86-NEXT: vmovdqa %xmm0, %xmm2 -; X86-NEXT: vmovdqa %xmm1, %xmm0 -; X86-NEXT: .LBB32_3: ; X86-NEXT: pushl %eax -; X86-NEXT: vminss %xmm2, %xmm0, %xmm1 +; X86-NEXT: vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; X86-NEXT: vandps %xmm2, %xmm0, %xmm2 +; X86-NEXT: vminss %xmm1, %xmm0, %xmm1 +; X86-NEXT: vorps %xmm1, %xmm2, %xmm1 ; X86-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 ; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovss %xmm0, (%esp) @@ -810,17 +766,9 @@ define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { ; X64-LABEL: fminimum_v4f64: ; X64: # %bb.0: -; X64-NEXT: vmovq %xmm0, %rax -; X64-NEXT: testq %rax, %rax -; X64-NEXT: js .LBB33_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: jmp .LBB33_3 -; X64-NEXT: .LBB33_1: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: vmovdqa %xmm1, %xmm0 -; X64-NEXT: .LBB33_3: -; X64-NEXT: vminsd %xmm2, %xmm0, %xmm1 +; X64-NEXT: vminsd %xmm1, %xmm0, %xmm1 +; X64-NEXT: vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2 +; X64-NEXT: vorpd %xmm1, %xmm2, %xmm1 ; X64-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 ; X64-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: vzeroupper @@ -828,21 +776,13 @@ define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { ; ; X86-LABEL: fminimum_v4f64: ; X86: # %bb.0: -; X86-NEXT: vextractps $1, %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB33_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovapd %xmm1, %xmm2 -; X86-NEXT: jmp .LBB33_3 -; X86-NEXT: .LBB33_1: -; X86-NEXT: vmovapd %xmm0, %xmm2 -; X86-NEXT: vmovapd %xmm1, %xmm0 -; X86-NEXT: .LBB33_3: ; X86-NEXT: pushl %ebp ; X86-NEXT: movl %esp, %ebp ; X86-NEXT: andl $-8, %esp ; X86-NEXT: subl $8, %esp -; X86-NEXT: vminsd %xmm2, %xmm0, %xmm1 +; X86-NEXT: vminsd %xmm1, %xmm0, %xmm1 +; X86-NEXT: vandpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2 +; X86-NEXT: vorpd %xmm1, %xmm2, %xmm1 ; X86-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 ; X86-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovlpd %xmm0, (%esp) diff --git a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/fminimum-fmaximum.ll index 06515e4f82687..910acd6d82ae2 100644 --- a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll +++ b/llvm/test/C... [truncated] |
3061feb to 5dfeb5c Compare | ; SSE2-NEXT: maxss %xmm1, %xmm0 | ||
| ; SSE2-NEXT: andnps %xmm0, %xmm2 | ||
| ; SSE2-NEXT: orps %xmm3, %xmm2 | ||
| ; SSE2-NEXT: movaps %xmm2, %xmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more instruction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a result of changing the NaN handling to check the source operand for NaN in minimumnum and maximumnum instead of the result operand. There's an extra movaps because we can't reuse a register; I'm not a microarchitecture expert, but I suspect it's "free" due to register renaming.
| ; X86-NEXT: vmaxss {{[0-9]+}}(%esp), %xmm0, %xmm1 | ||
| ; X86-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2 | ||
| ; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 | ||
| ; X86-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same answer as above applies here.
| @@ -1782,4 +1228,5 @@ declare double @llvm.vector.reduce.fmaximum.v4f64(<4 x double>) | |||
| declare double @llvm.vector.reduce.fmaximum.v8f64(<8 x double>) | |||
| declare double @llvm.vector.reduce.fmaximum.v16f64(<16 x double>) | |||
| ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line: | |||
| ; AVX512: {{.*}} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comment above says, this list of prefixes is autogenerated. Is there some incantation I can use to have it not generate this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, remove it from the RUN lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other issues with the RUN lines; the SSE prefix is also unused and there's a warning about conflicting output for the two AVX lines (I think the +avx2 one was meant to use the AVX2 prefix.) Should I fix those in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great.
| ; CHECK-NEXT: vmaxph %zmm1, %zmm0, %zmm2 | ||
| ; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN] | ||
| ; CHECK-NEXT: vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code seems better since no memory load there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm-mca seems to think the new code is faster despite the memory load, at least in a tight loop. You can try it yourself with:
# LLVM-MCA-BEGIN old vpmovw2m %zmm0, %k1 vpblendmw %zmm0, %zmm1, %zmm2 {%k1} vmovdqu16 %zmm1, %zmm0 {%k1} vminph %zmm2, %zmm0, %zmm0 # LLVM-MCA-END # LLVM-MCA-BEGIN new vminph %zmm1, %zmm0, %zmm1 vpbroadcastw (%rdi), %zmm2 vpternlogq $248, %zmm2, %zmm0, %zmm1 # LLVM-MCA-ENDHere's what I get with -mcpu=sapphirerapids:
[0] Code Region - old Iterations: 100 Instructions: 400 Total Cycles: 1103 Total uOps: 400 Dispatch Width: 6 uOps Per Cycle: 0.36 IPC: 0.36 Block RThroughput: 2.0 [snip] [1] Code Region - new Iterations: 100 Instructions: 300 Total Cycles: 607 Total uOps: 400 Dispatch Width: 6 uOps Per Cycle: 0.66 IPC: 0.49 Block RThroughput: 1.0
phoebewang left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current
minimum/maximum/minimumnum/maximumnumcode is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so, even ifblendvis supported. (Should I open a separate issue for that? It seems concerning that LLVM would rather scalarize a vector operation than emit a coupleblendvoperations in a row.)It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:
We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:
For taking the minimum:
minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.Analogously, for taking the maximum:
maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.
Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.
I believe this approach is faster than even properly-vectorized
blendvoperations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a singlevpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster thanblendv, which makes me confident this is at least not a regression.As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from
SDValue NaNSrc = IsNum ? MinMax : NewX;toSDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates thecmpunordinstruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses
VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.