[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

valadaptive · 2025-12-01T03:39:34Z

I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current minimum/maximum/minimumnum/maximumnum code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so, even if blendv is supported. (Should I open a separate issue for that? It seems concerning that LLVM would rather scalarize a vector operation than emit a couple blendv operations in a row.)

It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:

When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number).
When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:

For taking the minimum:
- Call minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.
- Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this.
Analogously, for taking the maximum:
- Call maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.
- Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit.
- Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.

Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.

I believe this approach is faster than even properly-vectorized blendv operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single vpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than blendv, which makes me confident this is at least not a regression.

As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from SDValue NaNSrc = IsNum ? MinMax : NewX; to SDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates the cmpunord instruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.

One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.

github-actions · 2025-12-01T03:39:51Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-12-01T03:40:23Z

@llvm/pr-subscribers-backend-x86

Author: None (valadaptive)

Changes

I got somewhat nerd-sniped when looking at a Rust issue and seeing this comment about how various min/max operations are compiled on various architectures. The current minimum/maximum/minimumnum/maximumnum code is very branchy because of the signed-zero handling. Even though we emit select operations, LLVM really prefers to lower them to branches, to the point of scalarizing vector code to do so.

It turns out that handling signed zero operands properly can be done using a couple bitwise operations, which is branchless and easily vectorizable, by taking advantage of the following properties:

When you take the maximum of two floats, the output sign bit will be the bitwise AND of the input sign bits (since 0 means positive, and the maximum always prefers the positive number).
When you take the minimum of two floats, the output sign bit will be the bitwise OR of the input sign bits (since 1 means negative, and the minimum always prefers the negative number).

We can further optimize this by taking advantage of the fact that x86's min/max instructions operate like a floating-point compare+select, returning the second operand if both are (positive or negative) zero. Altogether, the operations go as follows:

For taking the minimum:
- Call minps/minpd/etc. on the input operands. This will return the minimum, unless both are zero, in which case it will return the second operand.
- Take the bitwise AND of the first operand and the highest bit, so that everything is zero except the sign bit.
- Finally, OR that with the minimum from earlier. The only incorrect case was when the second operand was +0.0 and the first operand was -0.0. By OR-ing the first operand's sign bit with the existing minimum, we correct this.
Analogously, for taking the maximum:
- Call maxps/maxpd/etc. on the input operands. This will return the maximum, unless both are zero, in which case it will return the second operand.
- Take the bitwise OR of the first operand and a bit pattern which is all ones except for the highest bit, so that everything is ones except the sign bit.
- Finally, AND that with the maximum from earlier.

In the case of NaNs, this approach might change the output NaN's sign bit. We don't have to worry about this for a couple reasons: firstly, LLVM's language reference allows NaNs to have a nondeterministic sign bit; secondly, there's already a step after this that selects one of the input NaNs anyway.

Here's an Alive2 proof. It obviously can't verify that the implementation is sound, but shows that at least the theory is.

I believe this approach is faster than even properly-vectorized blendv operations because it eliminates a data dependency chain. Furthermore on AVX-512, the load, AND, and OR can become a single vpternlogd. My highly-unrepresentative microbenchmarks (compiled for x86-64-v2, so SSE4.1) say ~7.5%-10% faster than blendv, which makes me confident this is at least not a regression.

As another minor optimization (I could split it into a separate PR if you prefer, but I'm significantly reworking the code here anyway), I've changed the NaN selection for the "prefer numeric" operations from SDValue NaNSrc = IsNum ? MinMax : NewX; to SDValue NaNSrc = IsNum ? NewY : NewX;. This has identical behavior, as detailed in the comment I added, but eliminates the cmpunord instruction's data dependency on the min/max operation immediately preceding it. This optimization is included in the Alive2 proof.

One thing I didn't change here: if we're comparing scalar f16s, the target supports AVX-512, and we know neither is NaN, there's a different signed-zero-correction path that uses VFPCLASSS. I haven't analyzed how it works, but it might be worth just removing that and using the bitwise version in all cases.

Patch is 293.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/170069.diff

6 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+81-40)
(modified) llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll (+16-26)
(modified) llvm/test/CodeGen/X86/extractelement-fp.ll (+36-96)
(modified) llvm/test/CodeGen/X86/fminimum-fmaximum.ll (+500-851)
(modified) llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll (+619-1056)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll (+877-1430)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 1b0bf6823e390..b6db3706c5d3a 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -29480,7 +29480,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, uint64_t SizeInBits = VT.getScalarSizeInBits(); APInt PreferredZero = APInt::getZero(SizeInBits); APInt OppositeZero = PreferredZero; - EVT IVT = VT.changeTypeToInteger(); X86ISD::NodeType MinMaxOp; if (IsMaxOp) { MinMaxOp = X86ISD::FMAX; @@ -29492,8 +29491,8 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), VT); - // The tables below show the expected result of Max in cases of NaN and - // signed zeros. + // The tables below show the expected result of Max in cases of NaN and signed + // zeros. // // Y Y // Num xNaN +0 -0 @@ -29503,12 +29502,9 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, // xNaN | X | X/Y | -0 | +0 | -0 | // --------------- --------------- // - // It is achieved by means of FMAX/FMIN with preliminary checks and operand - // reordering. - // - // We check if any of operands is NaN and return NaN. Then we check if any of - // operands is zero or negative zero (for fmaximum and fminimum respectively) - // to ensure the correct zero is returned. + // It is achieved by means of FMAX/FMIN with preliminary checks, operand + // reordering if one operand is a constant, and bitwise operations and selects + // to handle signed zero and NaN operands otherwise. auto MatchesZero = [](SDValue Op, APInt Zero) { Op = peekThroughBitcasts(Op); if (auto *CstOp = dyn_cast<ConstantFPSDNode>(Op)) @@ -29539,15 +29535,17 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, Op->getFlags().hasNoSignedZeros() || DAG.isKnownNeverZeroFloat(X) || DAG.isKnownNeverZeroFloat(Y); - SDValue NewX, NewY; + bool ShouldHandleZeros = true; + SDValue NewX = X; + SDValue NewY = Y; if (IgnoreSignedZero || MatchesZero(Y, PreferredZero) || MatchesZero(X, OppositeZero)) { // Operands are already in right order or order does not matter. - NewX = X; - NewY = Y; + ShouldHandleZeros = false; } else if (MatchesZero(X, PreferredZero) || MatchesZero(Y, OppositeZero)) { NewX = Y; NewY = X; + ShouldHandleZeros = false; } else if (!VT.isVector() && (VT == MVT::f16 || Subtarget.hasDQI()) && (Op->getFlags().hasNoNaNs() || IsXNeverNaN || IsYNeverNaN)) { if (IsXNeverNaN) @@ -29569,33 +29567,6 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, NewX = DAG.getSelect(DL, VT, NeedSwap, Y, X); NewY = DAG.getSelect(DL, VT, NeedSwap, X, Y); return DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags()); - } else { - SDValue IsXSigned; - if (Subtarget.is64Bit() || VT != MVT::f64) { - SDValue XInt = DAG.getNode(ISD::BITCAST, DL, IVT, X); - SDValue ZeroCst = DAG.getConstant(0, DL, IVT); - IsXSigned = DAG.getSetCC(DL, SetCCType, XInt, ZeroCst, ISD::SETLT); - } else { - assert(VT == MVT::f64); - SDValue Ins = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, MVT::v2f64, - DAG.getConstantFP(0, DL, MVT::v2f64), X, - DAG.getVectorIdxConstant(0, DL)); - SDValue VX = DAG.getNode(ISD::BITCAST, DL, MVT::v4f32, Ins); - SDValue Hi = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, VX, - DAG.getVectorIdxConstant(1, DL)); - Hi = DAG.getBitcast(MVT::i32, Hi); - SDValue ZeroCst = DAG.getConstant(0, DL, MVT::i32); - EVT SetCCType = TLI.getSetCCResultType(DAG.getDataLayout(), - *DAG.getContext(), MVT::i32); - IsXSigned = DAG.getSetCC(DL, SetCCType, Hi, ZeroCst, ISD::SETLT); - } - if (MinMaxOp == X86ISD::FMAX) { - NewX = DAG.getSelect(DL, VT, IsXSigned, X, Y); - NewY = DAG.getSelect(DL, VT, IsXSigned, Y, X); - } else { - NewX = DAG.getSelect(DL, VT, IsXSigned, Y, X); - NewY = DAG.getSelect(DL, VT, IsXSigned, X, Y); - } } bool IgnoreNaN = DAG.getTarget().Options.NoNaNsFPMath || @@ -29612,10 +29583,80 @@ static SDValue LowerFMINIMUM_FMAXIMUM(SDValue Op, const X86Subtarget &Subtarget, SDValue MinMax = DAG.getNode(MinMaxOp, DL, VT, NewX, NewY, Op->getFlags()); + // We handle signed-zero ordering by taking the larger (or smaller) sign bit. + if (ShouldHandleZeros) { + const fltSemantics &Sem = VT.getFltSemantics(); + unsigned EltBits = VT.getScalarSizeInBits(); + bool IsFakeVector = !VT.isVector(); + MVT LogicVT = VT.getSimpleVT(); + if (IsFakeVector) + LogicVT = (VT == MVT::f64) ? MVT::v2f64 + : (VT == MVT::f32) ? MVT::v4f32 + : MVT::v8f16; + + // We take the sign bit from the first operand and combine it with the + // output sign bit (see below). Right now, if ShouldHandleZeros is true, the + // operands will never have been swapped. If you add another optimization + // that swaps the input operands if one is a known value, make sure this + // logic stays correct! + SDValue LogicX = NewX; + SDValue LogicMinMax = MinMax; + if (IsFakeVector) { + // Promote scalars to vectors for bitwise operations. + LogicX = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, NewX); + LogicMinMax = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, LogicVT, MinMax); + } + + // x86's min/max operations return the second operand if both inputs are + // signed zero. For the maximum operation, we want to "and" the sign bit of + // the output with the sign bit of the first operand--that means that if the + // first operand is +0.0, the output will be too. For the minimum, it's the + // opposite: we "or" the output sign bit with the sign bit of the first + // operand, ensuring that if the first operand is -0.0, the output will be + // too. + SDValue Result; + if (IsMaxOp) { + // getSignedMaxValue returns a bit pattern of all ones but the highest + // bit. We "or" that with the first operand, then "and" that with the max + // operation's result. That clears only the sign bit, and only if the + // first operand is positive. + SDValue OrMask = DAG.getConstantFP( + APFloat(Sem, APInt::getSignedMaxValue(EltBits)), DL, LogicVT); + SDValue MaskedSignBit = + DAG.getNode(X86ISD::FOR, DL, LogicVT, LogicX, OrMask); + Result = + DAG.getNode(X86ISD::FAND, DL, LogicVT, MaskedSignBit, LogicMinMax); + } else { + // Likewise, getSignedMinValue returns a bit pattern with only the highest + // bit set. This one *sets* only the sign bit, and only if the first + // operand is *negative*. + SDValue AndMask = DAG.getConstantFP( + APFloat(Sem, APInt::getSignMask(EltBits)), DL, LogicVT); + SDValue MaskedSignBit = + DAG.getNode(X86ISD::FAND, DL, LogicVT, LogicX, AndMask); + Result = + DAG.getNode(X86ISD::FOR, DL, LogicVT, MaskedSignBit, LogicMinMax); + } + + // Extract scalar back from vector. + if (IsFakeVector) + MinMax = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, Result, + DAG.getVectorIdxConstant(0, DL)); + else + MinMax = Result; + } + if (IgnoreNaN || DAG.isKnownNeverNaN(IsNum ? NewY : NewX)) return MinMax; - SDValue NaNSrc = IsNum ? MinMax : NewX; + // The x86 min/max return the second operand if either is NaN, which doesn't + // match the numeric or non-numeric semantics. For the non-numeric versions, + // we want to return NaN if either operand is NaN. To do that, we check if + // NewX (the first operand) is NaN, and select it if so. For the numeric + // versions, we want to return the non-NaN operand if there is one. So we + // check if NewY (the second operand) is NaN, and again select the first + // operand if so. + SDValue NaNSrc = IsNum ? NewY : NewX; SDValue IsNaN = DAG.getSetCC(DL, SetCCType, NaNSrc, NaNSrc, ISD::SETUO); return DAG.getSelect(DL, VT, IsNaN, NewX, MinMax); diff --git a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll index 53ac283170a5f..59cf38f82b7c0 100644 --- a/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll +++ b/llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll @@ -13,14 +13,9 @@ declare <32 x half> @llvm.maximum.v32f16(<32 x half>, <32 x half>) define half @test_fminimum(half %x, half %y) { ; CHECK-LABEL: test_fminimum: ; CHECK: # %bb.0: -; CHECK-NEXT: vmovw %xmm0, %eax -; CHECK-NEXT: testw %ax, %ax -; CHECK-NEXT: sets %al -; CHECK-NEXT: kmovd %eax, %k1 -; CHECK-NEXT: vmovaps %xmm1, %xmm2 -; CHECK-NEXT: vmovsh %xmm0, %xmm0, %xmm2 {%k1} -; CHECK-NEXT: vmovsh %xmm1, %xmm0, %xmm0 {%k1} -; CHECK-NEXT: vminsh %xmm2, %xmm0, %xmm1 +; CHECK-NEXT: vminsh %xmm1, %xmm0, %xmm2 +; CHECK-NEXT: vpbroadcastw {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; CHECK-NEXT: vpternlogq {{.*#+}} xmm1 = (xmm1 & xmm0) | xmm2 ; CHECK-NEXT: vcmpunordsh %xmm0, %xmm0, %k1 ; CHECK-NEXT: vmovsh %xmm0, %xmm0, %xmm1 {%k1} ; CHECK-NEXT: vmovaps %xmm1, %xmm0 @@ -92,16 +87,12 @@ define half @test_fminimum_combine_cmps(half %x, half %y) { define half @test_fmaximum(half %x, half %y) { ; CHECK-LABEL: test_fmaximum: ; CHECK: # %bb.0: -; CHECK-NEXT: vmovw %xmm0, %eax -; CHECK-NEXT: testw %ax, %ax -; CHECK-NEXT: sets %al -; CHECK-NEXT: kmovd %eax, %k1 -; CHECK-NEXT: vmovaps %xmm0, %xmm2 -; CHECK-NEXT: vmovsh %xmm1, %xmm0, %xmm2 {%k1} +; CHECK-NEXT: vmaxsh %xmm1, %xmm0, %xmm2 +; CHECK-NEXT: vpbroadcastw {{.*#+}} xmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN] +; CHECK-NEXT: vpternlogq {{.*#+}} xmm1 = xmm2 & (xmm1 | xmm0) +; CHECK-NEXT: vcmpunordsh %xmm0, %xmm0, %k1 ; CHECK-NEXT: vmovsh %xmm0, %xmm0, %xmm1 {%k1} -; CHECK-NEXT: vmaxsh %xmm2, %xmm1, %xmm0 -; CHECK-NEXT: vcmpunordsh %xmm1, %xmm1, %k1 -; CHECK-NEXT: vmovsh %xmm1, %xmm0, %xmm0 {%k1} +; CHECK-NEXT: vmovaps %xmm1, %xmm0 ; CHECK-NEXT: retq %r = call half @llvm.maximum.f16(half %x, half %y) ret half %r @@ -196,10 +187,9 @@ define <16 x half> @test_fmaximum_v16f16_nans(<16 x half> %x, <16 x half> %y) "n define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "no-nans-fp-math"="true" { ; CHECK-LABEL: test_fminimum_v32f16_szero: ; CHECK: # %bb.0: -; CHECK-NEXT: vpmovw2m %zmm0, %k1 -; CHECK-NEXT: vpblendmw %zmm0, %zmm1, %zmm2 {%k1} -; CHECK-NEXT: vmovdqu16 %zmm1, %zmm0 {%k1} -; CHECK-NEXT: vminph %zmm2, %zmm0, %zmm0 +; CHECK-NEXT: vminph %zmm1, %zmm0, %zmm1 +; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; CHECK-NEXT: vpternlogq {{.*#+}} zmm0 = (zmm0 & zmm2) | zmm1 ; CHECK-NEXT: retq %r = call <32 x half> @llvm.minimum.v32f16(<32 x half> %x, <32 x half> %y) ret <32 x half> %r @@ -208,12 +198,12 @@ define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) " define <32 x half> @test_fmaximum_v32f16_nans_szero(<32 x half> %x, <32 x half> %y) { ; CHECK-LABEL: test_fmaximum_v32f16_nans_szero: ; CHECK: # %bb.0: -; CHECK-NEXT: vpmovw2m %zmm0, %k1 -; CHECK-NEXT: vpblendmw %zmm1, %zmm0, %zmm2 {%k1} +; CHECK-NEXT: vmaxph %zmm1, %zmm0, %zmm2 +; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN] +; CHECK-NEXT: vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0) +; CHECK-NEXT: vcmpunordph %zmm0, %zmm0, %k1 ; CHECK-NEXT: vmovdqu16 %zmm0, %zmm1 {%k1} -; CHECK-NEXT: vmaxph %zmm2, %zmm1, %zmm0 -; CHECK-NEXT: vcmpunordph %zmm1, %zmm1, %k1 -; CHECK-NEXT: vmovdqu16 %zmm1, %zmm0 {%k1} +; CHECK-NEXT: vmovdqa64 %zmm1, %zmm0 ; CHECK-NEXT: retq %r = call <32 x half> @llvm.maximum.v32f16(<32 x half> %x, <32 x half> %y) ret <32 x half> %r diff --git a/llvm/test/CodeGen/X86/extractelement-fp.ll b/llvm/test/CodeGen/X86/extractelement-fp.ll index 1706f17eac165..9ce1fd6f20976 100644 --- a/llvm/test/CodeGen/X86/extractelement-fp.ll +++ b/llvm/test/CodeGen/X86/extractelement-fp.ll @@ -675,37 +675,23 @@ define double @fminnum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { ; X64-LABEL: fmaximum_v4f32: ; X64: # %bb.0: -; X64-NEXT: vmovd %xmm0, %eax -; X64-NEXT: testl %eax, %eax -; X64-NEXT: js .LBB30_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: jmp .LBB30_3 -; X64-NEXT: .LBB30_1: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: vmovdqa %xmm0, %xmm1 -; X64-NEXT: .LBB30_3: -; X64-NEXT: vmaxss %xmm2, %xmm1, %xmm0 -; X64-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2 -; X64-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 +; X64-NEXT: vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN] +; X64-NEXT: vorps %xmm2, %xmm0, %xmm2 +; X64-NEXT: vmaxss %xmm1, %xmm0, %xmm1 +; X64-NEXT: vandps %xmm1, %xmm2, %xmm1 +; X64-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 +; X64-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: retq ; ; X86-LABEL: fmaximum_v4f32: ; X86: # %bb.0: -; X86-NEXT: vmovd %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB30_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovdqa %xmm0, %xmm2 -; X86-NEXT: jmp .LBB30_3 -; X86-NEXT: .LBB30_1: -; X86-NEXT: vmovdqa %xmm1, %xmm2 -; X86-NEXT: vmovdqa %xmm0, %xmm1 -; X86-NEXT: .LBB30_3: ; X86-NEXT: pushl %eax -; X86-NEXT: vmaxss %xmm2, %xmm1, %xmm0 -; X86-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2 -; X86-NEXT: vblendvps %xmm2, %xmm1, %xmm0, %xmm0 +; X86-NEXT: vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN] +; X86-NEXT: vorps %xmm2, %xmm0, %xmm2 +; X86-NEXT: vmaxss %xmm1, %xmm0, %xmm1 +; X86-NEXT: vandps %xmm1, %xmm2, %xmm1 +; X86-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 +; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovss %xmm0, (%esp) ; X86-NEXT: flds (%esp) ; X86-NEXT: popl %eax @@ -718,41 +704,25 @@ define float @fmaximum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { ; X64-LABEL: fmaximum_v4f64: ; X64: # %bb.0: -; X64-NEXT: vmovq %xmm0, %rax -; X64-NEXT: testq %rax, %rax -; X64-NEXT: js .LBB31_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: jmp .LBB31_3 -; X64-NEXT: .LBB31_1: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: vmovdqa %xmm0, %xmm1 -; X64-NEXT: .LBB31_3: -; X64-NEXT: vmaxsd %xmm2, %xmm1, %xmm0 -; X64-NEXT: vcmpunordsd %xmm1, %xmm1, %xmm2 -; X64-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 +; X64-NEXT: vmaxsd %xmm1, %xmm0, %xmm1 +; X64-NEXT: vorpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2 +; X64-NEXT: vandpd %xmm1, %xmm2, %xmm1 +; X64-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 +; X64-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: vzeroupper ; X64-NEXT: retq ; ; X86-LABEL: fmaximum_v4f64: ; X86: # %bb.0: -; X86-NEXT: vextractps $1, %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB31_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovapd %xmm0, %xmm2 -; X86-NEXT: jmp .LBB31_3 -; X86-NEXT: .LBB31_1: -; X86-NEXT: vmovapd %xmm1, %xmm2 -; X86-NEXT: vmovapd %xmm0, %xmm1 -; X86-NEXT: .LBB31_3: ; X86-NEXT: pushl %ebp ; X86-NEXT: movl %esp, %ebp ; X86-NEXT: andl $-8, %esp ; X86-NEXT: subl $8, %esp -; X86-NEXT: vmaxsd %xmm2, %xmm1, %xmm0 -; X86-NEXT: vcmpunordsd %xmm1, %xmm1, %xmm2 -; X86-NEXT: vblendvpd %xmm2, %xmm1, %xmm0, %xmm0 +; X86-NEXT: vmaxsd %xmm1, %xmm0, %xmm1 +; X86-NEXT: vorpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2 +; X86-NEXT: vandpd %xmm1, %xmm2, %xmm1 +; X86-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 +; X86-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovlpd %xmm0, (%esp) ; X86-NEXT: fldl (%esp) ; X86-NEXT: movl %ebp, %esp @@ -767,35 +737,21 @@ define double @fmaximum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { ; X64-LABEL: fminimum_v4f32: ; X64: # %bb.0: -; X64-NEXT: vmovd %xmm0, %eax -; X64-NEXT: testl %eax, %eax -; X64-NEXT: js .LBB32_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: jmp .LBB32_3 -; X64-NEXT: .LBB32_1: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: vmovdqa %xmm1, %xmm0 -; X64-NEXT: .LBB32_3: -; X64-NEXT: vminss %xmm2, %xmm0, %xmm1 +; X64-NEXT: vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; X64-NEXT: vandps %xmm2, %xmm0, %xmm2 +; X64-NEXT: vminss %xmm1, %xmm0, %xmm1 +; X64-NEXT: vorps %xmm1, %xmm2, %xmm1 ; X64-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 ; X64-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: retq ; ; X86-LABEL: fminimum_v4f32: ; X86: # %bb.0: -; X86-NEXT: vmovd %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB32_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovdqa %xmm1, %xmm2 -; X86-NEXT: jmp .LBB32_3 -; X86-NEXT: .LBB32_1: -; X86-NEXT: vmovdqa %xmm0, %xmm2 -; X86-NEXT: vmovdqa %xmm1, %xmm0 -; X86-NEXT: .LBB32_3: ; X86-NEXT: pushl %eax -; X86-NEXT: vminss %xmm2, %xmm0, %xmm1 +; X86-NEXT: vbroadcastss {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0] +; X86-NEXT: vandps %xmm2, %xmm0, %xmm2 +; X86-NEXT: vminss %xmm1, %xmm0, %xmm1 +; X86-NEXT: vorps %xmm1, %xmm2, %xmm1 ; X86-NEXT: vcmpunordss %xmm0, %xmm0, %xmm2 ; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovss %xmm0, (%esp) @@ -810,17 +766,9 @@ define float @fminimum_v4f32(<4 x float> %x, <4 x float> %y) nounwind { define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { ; X64-LABEL: fminimum_v4f64: ; X64: # %bb.0: -; X64-NEXT: vmovq %xmm0, %rax -; X64-NEXT: testq %rax, %rax -; X64-NEXT: js .LBB33_1 -; X64-NEXT: # %bb.2: -; X64-NEXT: vmovdqa %xmm1, %xmm2 -; X64-NEXT: jmp .LBB33_3 -; X64-NEXT: .LBB33_1: -; X64-NEXT: vmovdqa %xmm0, %xmm2 -; X64-NEXT: vmovdqa %xmm1, %xmm0 -; X64-NEXT: .LBB33_3: -; X64-NEXT: vminsd %xmm2, %xmm0, %xmm1 +; X64-NEXT: vminsd %xmm1, %xmm0, %xmm1 +; X64-NEXT: vandpd {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2 +; X64-NEXT: vorpd %xmm1, %xmm2, %xmm1 ; X64-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 ; X64-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X64-NEXT: vzeroupper @@ -828,21 +776,13 @@ define double @fminimum_v4f64(<4 x double> %x, <4 x double> %y) nounwind { ; ; X86-LABEL: fminimum_v4f64: ; X86: # %bb.0: -; X86-NEXT: vextractps $1, %xmm0, %eax -; X86-NEXT: testl %eax, %eax -; X86-NEXT: js .LBB33_1 -; X86-NEXT: # %bb.2: -; X86-NEXT: vmovapd %xmm1, %xmm2 -; X86-NEXT: jmp .LBB33_3 -; X86-NEXT: .LBB33_1: -; X86-NEXT: vmovapd %xmm0, %xmm2 -; X86-NEXT: vmovapd %xmm1, %xmm0 -; X86-NEXT: .LBB33_3: ; X86-NEXT: pushl %ebp ; X86-NEXT: movl %esp, %ebp ; X86-NEXT: andl $-8, %esp ; X86-NEXT: subl $8, %esp -; X86-NEXT: vminsd %xmm2, %xmm0, %xmm1 +; X86-NEXT: vminsd %xmm1, %xmm0, %xmm1 +; X86-NEXT: vandpd {{\.?LCPI[0-9]+_[0-9]+}}, %xmm0, %xmm2 +; X86-NEXT: vorpd %xmm1, %xmm2, %xmm1 ; X86-NEXT: vcmpunordsd %xmm0, %xmm0, %xmm2 ; X86-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 ; X86-NEXT: vmovlpd %xmm0, (%esp) diff --git a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll b/llvm/test/CodeGen/X86/fminimum-fmaximum.ll index 06515e4f82687..910acd6d82ae2 100644 --- a/llvm/test/CodeGen/X86/fminimum-fmaximum.ll +++ b/llvm/test/C... [truncated]

… ops

phoebewang · 2025-12-01T08:09:50Z

llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll

+; SSE2-NEXT: maxss %xmm1, %xmm0
+; SSE2-NEXT: andnps %xmm0, %xmm2
+; SSE2-NEXT: orps %xmm3, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm0


One more instruction here.

I think this is a result of changing the NaN handling to check the source operand for NaN in minimumnum and maximumnum instead of the result operand. There's an extra movaps because we can't reuse a register; I'm not a microarchitecture expert, but I suspect it's "free" due to register renaming.

phoebewang · 2025-12-01T08:10:01Z

llvm/test/CodeGen/X86/fminimumnum-fmaximumnum.ll

-; X86-NEXT: vmaxss {{[0-9]+}}(%esp), %xmm0, %xmm1
-; X86-NEXT: vcmpunordss %xmm1, %xmm1, %xmm2
-; X86-NEXT: vblendvps %xmm2, %xmm0, %xmm1, %xmm0
+; X86-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero


I think the same answer as above applies here.

phoebewang · 2025-12-01T08:10:32Z

llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll

@@ -1782,4 +1228,5 @@ declare double @llvm.vector.reduce.fmaximum.v4f64(<4 x double>)
 declare double @llvm.vector.reduce.fmaximum.v8f64(<8 x double>)
 declare double @llvm.vector.reduce.fmaximum.v16f64(<16 x double>)
 ;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; AVX512: {{.*}}


Remove unused prefix.

As the comment above says, this list of prefixes is autogenerated. Is there some incantation I can use to have it not generate this list?

Yes, remove it from the RUN lines.

There are some other issues with the RUN lines; the SSE prefix is also unused and there's a warning about conflicting output for the two AVX lines (I think the +avx2 one was meant to use the AVX2 prefix.) Should I fix those in a separate PR?

Yes, that would be great.

phoebewang · 2025-12-01T08:15:12Z

llvm/test/CodeGen/X86/avx512fp16-fminimum-fmaximum.ll

+; CHECK-NEXT: vmaxph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
+; CHECK-NEXT: vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0)


The old code seems better since no memory load there.

llvm-mca seems to think the new code is faster despite the memory load, at least in a tight loop. You can try it yourself with:

# LLVM-MCA-BEGIN old vpmovw2m %zmm0, %k1 vpblendmw %zmm0, %zmm1, %zmm2 {%k1} vmovdqu16 %zmm1, %zmm0 {%k1} vminph %zmm2, %zmm0, %zmm0 # LLVM-MCA-END # LLVM-MCA-BEGIN new vminph %zmm1, %zmm0, %zmm1 vpbroadcastw (%rdi), %zmm2 vpternlogq $248, %zmm2, %zmm0, %zmm1 # LLVM-MCA-END

Here's what I get with -mcpu=sapphirerapids:

[0] Code Region - old Iterations: 100 Instructions: 400 Total Cycles: 1103 Total uOps: 400 Dispatch Width: 6 uOps Per Cycle: 0.36 IPC: 0.36 Block RThroughput: 2.0 [snip] [1] Code Region - new Iterations: 100 Instructions: 300 Total Cycles: 607 Total uOps: 400 Dispatch Width: 6 uOps Per Cycle: 0.66 IPC: 0.49 Block RThroughput: 1.0

phoebewang

LGTM.

llvmbot added the backend:X86 label Dec 1, 2025

[X86][ISelLowering] Lower minimum[num]/maximum[num] using bitwise…

5dfeb5c

… ops

valadaptive force-pushed the fminimum-maximum-bitwise branch from 3061feb to 5dfeb5c Compare December 1, 2025 05:39

valadaptive mentioned this pull request Dec 1, 2025

Tracking Issue for float_minimum_maximum rust-lang/rust#91079

Open

4 tasks

phoebewang reviewed Dec 1, 2025

View reviewed changes

[X86] Remove unused check prefix from vector-reduce-fmaximum.ll

1757f2e

phoebewang approved these changes Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

valadaptive commented Dec 1, 2025 •

edited

Loading

github-actions bot commented Dec 1, 2025

llvmbot commented Dec 1, 2025

phoebewang Dec 1, 2025

valadaptive Dec 1, 2025

phoebewang Dec 1, 2025

valadaptive Dec 1, 2025

phoebewang Dec 1, 2025

valadaptive Dec 1, 2025

phoebewang Dec 1, 2025

valadaptive Dec 1, 2025 •

edited

Loading

phoebewang Dec 1, 2025

phoebewang Dec 1, 2025

valadaptive Dec 1, 2025

phoebewang left a comment

Labels

3 participants

[X86] Lower minimum/maximum/minimumnum/maximumnum using bitwise operations #170069

Are you sure you want to change the base?

[X86] Lower minimum/maximum/minimumnum/maximumnum using bitwise operations #170069

Conversation

valadaptive commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Dec 1, 2025

llvmbot commented Dec 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valadaptive Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phoebewang left a comment

Choose a reason for hiding this comment

Labels

3 participants

[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

[X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations #170069

valadaptive commented Dec 1, 2025 •

edited

Loading

valadaptive Dec 1, 2025 •

edited

Loading