- Notifications
You must be signed in to change notification settings - Fork 15.3k
[SelectionDAG] Add TRUNCATE_SSAT_S/U and TRUNCATE_USAT_U to canCreateUndefOrPoison and computeKnownBits (#152143) #168809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[SelectionDAG] Add TRUNCATE_SSAT_S/U and TRUNCATE_USAT_U to canCreateUndefOrPoison and computeKnownBits (#152143) #168809
Conversation
…ison (llvm#152143) Saturating truncation operations are well-defined for all inputs and cannot create poison or undef values. This allows the optimizer to eliminate unnecessary freeze instructions after these operations. Fixes llvm#152143
| Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
| @llvm/pr-subscribers-backend-aarch64 @llvm/pr-subscribers-backend-x86 Author: Jerry Dang (kuroyukiasuna) ChangesSaturating truncation operations are well-defined for all inputs and cannot create poison or undef values. This allows the optimizer to eliminate unnecessary freeze instructions after these operations. Fixes #152143 Full diff: https://github.com/llvm/llvm-project/pull/168809.diff 2 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp index 1b15a207a2d37..0f0174c8aea35 100644 --- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp @@ -5664,6 +5664,9 @@ bool SelectionDAG::canCreateUndefOrPoison(SDValue Op, const APInt &DemandedElts, case ISD::FP_EXTEND: case ISD::FP_TO_SINT_SAT: case ISD::FP_TO_UINT_SAT: + case ISD::TRUNCATE_SSAT_U: + case ISD::TRUNCATE_SSAT_S: + case ISD::TRUNCATE_USAT_U: // No poison except from flags (which is handled above) return false; diff --git a/llvm/test/CodeGen/X86/truncate-sat-freeze.ll b/llvm/test/CodeGen/X86/truncate-sat-freeze.ll new file mode 100644 index 0000000000000..78aebe05ec1de --- /dev/null +++ b/llvm/test/CodeGen/X86/truncate-sat-freeze.ll @@ -0,0 +1,64 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py +; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s + +; Test that freeze is eliminated for saturation truncate patterns. +; The freeze elimination happens at the IR level due to the IntrNoCreateUndefOrPoison +; attribute on the llvm.smax/smin/umin intrinsics. At the SelectionDAG level, +; TRUNCATE_SSAT_S/U and TRUNCATE_USAT_U operations are also marked in +; canCreateUndefOrPoison() to ensure consistency and enable potential future +; optimizations. This test validates the end-to-end behavior that no freeze +; instruction appears in the output. + +define <2 x i32> @trunc_ssat_s_freeze(<2 x i64> %a0) { +; CHECK-LABEL: trunc_ssat_s_freeze: +; CHECK: # %bb.0: +; CHECK-NEXT: vpbroadcastq {{.*#+}} xmm1 = [18446744071562067968,18446744071562067968] +; CHECK-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm2 +; CHECK-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vpbroadcastq {{.*#+}} xmm1 = [2147483647,2147483647] +; CHECK-NEXT: vpcmpgtq %xmm0, %xmm1, %xmm2 +; CHECK-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2,2,3] +; CHECK-NEXT: retq + %1 = call <2 x i64> @llvm.smax.v2i64(<2 x i64> %a0, <2 x i64> <i64 -2147483648, i64 -2147483648>) + %2 = call <2 x i64> @llvm.smin.v2i64(<2 x i64> %1, <2 x i64> <i64 2147483647, i64 2147483647>) + %3 = trunc <2 x i64> %2 to <2 x i32> + %4 = freeze <2 x i32> %3 + ret <2 x i32> %4 +} + +define <2 x i32> @trunc_ssat_u_freeze(<2 x i64> %a0) { +; CHECK-LABEL: trunc_ssat_u_freeze: +; CHECK: # %bb.0: +; CHECK-NEXT: vpxor %xmm1, %xmm1, %xmm1 +; CHECK-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm1 +; CHECK-NEXT: vpand %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vpmovsxbd {{.*#+}} xmm1 = [4294967295,0,4294967295,0] +; CHECK-NEXT: vpcmpgtq %xmm0, %xmm1, %xmm2 +; CHECK-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2,2,3] +; CHECK-NEXT: retq + %1 = call <2 x i64> @llvm.smax.v2i64(<2 x i64> %a0, <2 x i64> zeroinitializer) + %2 = call <2 x i64> @llvm.smin.v2i64(<2 x i64> %1, <2 x i64> <i64 4294967295, i64 4294967295>) + %3 = trunc <2 x i64> %2 to <2 x i32> + %4 = freeze <2 x i32> %3 + ret <2 x i32> %4 +} + +define <2 x i32> @trunc_usat_u_freeze(<2 x i64> %a0) { +; CHECK-LABEL: trunc_usat_u_freeze: +; CHECK: # %bb.0: +; CHECK-NEXT: vpxor {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm1 +; CHECK-NEXT: vpcmpgtq {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1 +; CHECK-NEXT: vblendvpd %xmm1, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0 +; CHECK-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2,2,3] +; CHECK-NEXT: retq + %1 = call <2 x i64> @llvm.umin.v2i64(<2 x i64> %a0, <2 x i64> <i64 4294967295, i64 4294967295>) + %2 = trunc <2 x i64> %1 to <2 x i32> + %3 = freeze <2 x i32> %2 + ret <2 x i32> %3 +} + +declare <2 x i64> @llvm.smax.v2i64(<2 x i64>, <2 x i64>) +declare <2 x i64> @llvm.smin.v2i64(<2 x i64>, <2 x i64>) +declare <2 x i64> @llvm.umin.v2i64(<2 x i64>, <2 x i64>) |
| @llvm/pr-subscribers-llvm-selectiondag Author: Jerry Dang (kuroyukiasuna) ChangesSaturating truncation operations are well-defined for all inputs and cannot create poison or undef values. This allows the optimizer to eliminate unnecessary freeze instructions after these operations. Fixes #152143 Full diff: https://github.com/llvm/llvm-project/pull/168809.diff 2 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp index 1b15a207a2d37..0f0174c8aea35 100644 --- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp @@ -5664,6 +5664,9 @@ bool SelectionDAG::canCreateUndefOrPoison(SDValue Op, const APInt &DemandedElts, case ISD::FP_EXTEND: case ISD::FP_TO_SINT_SAT: case ISD::FP_TO_UINT_SAT: + case ISD::TRUNCATE_SSAT_U: + case ISD::TRUNCATE_SSAT_S: + case ISD::TRUNCATE_USAT_U: // No poison except from flags (which is handled above) return false; diff --git a/llvm/test/CodeGen/X86/truncate-sat-freeze.ll b/llvm/test/CodeGen/X86/truncate-sat-freeze.ll new file mode 100644 index 0000000000000..78aebe05ec1de --- /dev/null +++ b/llvm/test/CodeGen/X86/truncate-sat-freeze.ll @@ -0,0 +1,64 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py +; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s + +; Test that freeze is eliminated for saturation truncate patterns. +; The freeze elimination happens at the IR level due to the IntrNoCreateUndefOrPoison +; attribute on the llvm.smax/smin/umin intrinsics. At the SelectionDAG level, +; TRUNCATE_SSAT_S/U and TRUNCATE_USAT_U operations are also marked in +; canCreateUndefOrPoison() to ensure consistency and enable potential future +; optimizations. This test validates the end-to-end behavior that no freeze +; instruction appears in the output. + +define <2 x i32> @trunc_ssat_s_freeze(<2 x i64> %a0) { +; CHECK-LABEL: trunc_ssat_s_freeze: +; CHECK: # %bb.0: +; CHECK-NEXT: vpbroadcastq {{.*#+}} xmm1 = [18446744071562067968,18446744071562067968] +; CHECK-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm2 +; CHECK-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vpbroadcastq {{.*#+}} xmm1 = [2147483647,2147483647] +; CHECK-NEXT: vpcmpgtq %xmm0, %xmm1, %xmm2 +; CHECK-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2,2,3] +; CHECK-NEXT: retq + %1 = call <2 x i64> @llvm.smax.v2i64(<2 x i64> %a0, <2 x i64> <i64 -2147483648, i64 -2147483648>) + %2 = call <2 x i64> @llvm.smin.v2i64(<2 x i64> %1, <2 x i64> <i64 2147483647, i64 2147483647>) + %3 = trunc <2 x i64> %2 to <2 x i32> + %4 = freeze <2 x i32> %3 + ret <2 x i32> %4 +} + +define <2 x i32> @trunc_ssat_u_freeze(<2 x i64> %a0) { +; CHECK-LABEL: trunc_ssat_u_freeze: +; CHECK: # %bb.0: +; CHECK-NEXT: vpxor %xmm1, %xmm1, %xmm1 +; CHECK-NEXT: vpcmpgtq %xmm1, %xmm0, %xmm1 +; CHECK-NEXT: vpand %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vpmovsxbd {{.*#+}} xmm1 = [4294967295,0,4294967295,0] +; CHECK-NEXT: vpcmpgtq %xmm0, %xmm1, %xmm2 +; CHECK-NEXT: vblendvpd %xmm2, %xmm0, %xmm1, %xmm0 +; CHECK-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2,2,3] +; CHECK-NEXT: retq + %1 = call <2 x i64> @llvm.smax.v2i64(<2 x i64> %a0, <2 x i64> zeroinitializer) + %2 = call <2 x i64> @llvm.smin.v2i64(<2 x i64> %1, <2 x i64> <i64 4294967295, i64 4294967295>) + %3 = trunc <2 x i64> %2 to <2 x i32> + %4 = freeze <2 x i32> %3 + ret <2 x i32> %4 +} + +define <2 x i32> @trunc_usat_u_freeze(<2 x i64> %a0) { +; CHECK-LABEL: trunc_usat_u_freeze: +; CHECK: # %bb.0: +; CHECK-NEXT: vpxor {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm1 +; CHECK-NEXT: vpcmpgtq {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1 +; CHECK-NEXT: vblendvpd %xmm1, {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0 +; CHECK-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2,2,3] +; CHECK-NEXT: retq + %1 = call <2 x i64> @llvm.umin.v2i64(<2 x i64> %a0, <2 x i64> <i64 4294967295, i64 4294967295>) + %2 = trunc <2 x i64> %1 to <2 x i32> + %3 = freeze <2 x i32> %2 + ret <2 x i32> %3 +} + +declare <2 x i64> @llvm.smax.v2i64(<2 x i64>, <2 x i64>) +declare <2 x i64> @llvm.smin.v2i64(<2 x i64>, <2 x i64>) +declare <2 x i64> @llvm.umin.v2i64(<2 x i64>, <2 x i64>) |
| ; Test that freeze is eliminated for saturation truncate patterns. | ||
| ; The freeze elimination happens at the IR level due to the IntrNoCreateUndefOrPoison | ||
| ; attribute on the llvm.smax/smin/umin intrinsics. At the SelectionDAG level, | ||
| ; TRUNCATE_SSAT_S/U and TRUNCATE_USAT_U operations are also marked in | ||
| ; canCreateUndefOrPoison() to ensure consistency and enable potential future | ||
| ; optimizations. This test validates the end-to-end behavior that no freeze | ||
| ; instruction appears in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like commented, this test doesn't really exercise the code path I added in SelectionDAG::canCreateUndefOrPoison. The IR level optimization is too aggressive. I'm fairly new to SelectionDAG, cannot think of better way to write a test, apologies. Any suggestions/ideas on better testing are appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget about x86 - you need to move this to aarch64 which uses these and preferably use their intrinsics which expand directly to ISD::TRUNCATE_SAT nodes (neon.sqxtn etc.).
Also to actually test this - you've got to demonstrate that the freeze has been removed / moved up through the truncate node. I typically use KnownBits/SignBits value tracking folds that can only occur once the freeze is moved out of the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RKSimon Something I tried:
define <4 x i32> @sqxtun_freeze_zext_and(<4 x i32> %a) { ; CHECK-LABEL: sqxtun_freeze_zext_and: ; CHECK: // %bb.0: ; CHECK-NEXT: sqxtun v0.4h, v0.4s ; CHECK-NEXT: ushll v0.4s, v0.4h, #0 ; CHECK-NEXT: ret %trunc = tail call <4 x i16> @llvm.aarch64.neon.sqxtun.v4i16(<4 x i32> %a) %freeze = freeze <4 x i16> %trunc %zext = zext <4 x i16> %freeze to <4 x i32> %and = and <4 x i32> %zext, <i32 65535, i32 65535, i32 65535, i32 65535> ret <4 x i32> %and } declare <4 x i16> @llvm.aarch64.neon.sqxtun.v4i16(<4 x i32>) I can verified the freeze elimination is working:
Legalized selection DAG (9 nodes):
t13: v4i16 = truncate_ssat_u t2 t5: v4i16 = freeze t13 t6: v4i32 = zero_extend t5 Optimized legalized selection DAG (8 nodes):
t13: v4i16 = truncate_ssat_u t2 t6: v4i32 = zero_extend t13 However, the final assembly is the same with or without my change, as a result the test would always pass. Seems the freeze is eliminated during instruction selection? Need some suggestion on a test pattern where this freeze elimination produces visibly different assembly - or is that the correct path to pursue? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried several test patterns to show visible codegen differences:
- Comparison with out-of-range values (should fold to constant)
- Redundant min/max operations (should be eliminated)
- KnownBits patterns with AND/zext
None produced different assembly with/without the change, though the DAG optimization does occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does DAG have value tracking handling for truncate sat nodes? I can't remember if I ever raised an issue for that or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> llvm-project % grep -r "TRUNCATE_SSAT\|TRUNCATE_USAT" llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp | grep -i "known\|sign" > llvm-project % grep -A 10 "computeKnownBits" llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp | grep -i truncate // Implicitly truncate the bits to match the official semantics of // BUILD_VECTOR can implicitly truncate sources, we must handle this. case ISD::TRUNCATE: { I searched for value tracking, seems there's no computeKnownBits or ComputeNumSignBits cases for TRUNCATE_SSAT_S/U or TRUNCATE_USAT_U, only regular TRUNCATE has value tracking implemented.
If we file another issue I can also tackle that in a follow-up PR, or I can have both in this PR. Will need some time (a few days) to ramp up & read the code.
🐧 Linux x64 Test Results
|
| ; Test that freeze is eliminated for saturation truncate patterns. | ||
| ; The freeze elimination happens at the IR level due to the IntrNoCreateUndefOrPoison | ||
| ; attribute on the llvm.smax/smin/umin intrinsics. At the SelectionDAG level, | ||
| ; TRUNCATE_SSAT_S/U and TRUNCATE_USAT_U operations are also marked in | ||
| ; canCreateUndefOrPoison() to ensure consistency and enable potential future | ||
| ; optimizations. This test validates the end-to-end behavior that no freeze | ||
| ; instruction appears in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget about x86 - you need to move this to aarch64 which uses these and preferably use their intrinsics which expand directly to ISD::TRUNCATE_SAT nodes (neon.sqxtn etc.).
Also to actually test this - you've got to demonstrate that the freeze has been removed / moved up through the truncate node. I typically use KnownBits/SignBits value tracking folds that can only occur once the freeze is moved out of the way.
| ;; ============================================================================ | ||
| ;; Tests for canCreateUndefOrPoison = false | ||
| ;; These verify that freeze operations are correctly eliminated | ||
| ;; ============================================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
canCreateUndefOrPoison
TRUNCATE_SSAT_S verified behaving correctly together with computeKnownBits.
But for TRUNCATE_SSAT_U and TRUNCATE_USAT_U, freeze elimination still happens at the IR level via ValueTracking.cpp - should we just move the elimination to IR level then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't want freeze elimination at IR level - would be semantically incorrect, inconsistent with other saturation operations, where poison is propagated.
| ret i16 %and | ||
| } | ||
| | ||
| ;; ============================================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computeKnownBits
TRUNCATE_SSAT_S verified behaving correctly together with canCreateUndefOrPoison.
TRUNCATE_SSAT_U and TRUNCATE_USAT_U do not demonstrate the same - optimizations like constant folding don't trigger.
Root Cause: Vector constant propagation issue. When analyzing operations like:
%masked = and <4 x i32> %x, <i32 255, i32 255, i32 255, i32 255> %trunc = call <4 x i16> @llvm.aarch64.neon.sqxtun.v4i16(<4 x i32> %masked) The constant vector gets lowered to AArch64-specific nodes (NVCAST → MOVIedit) which lose the constant information. During computeKnownBits analysis:
TRUNCATE_SSAT_U: InputKnown = ???????????????????????????????? Input range (signed): [-2147483648, 2147483647] The broken constants appear to be byte-aligned all-ones values that the AArch64 backend optimizes into MOVI immediate instructions, losing the constant value during the transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to create working tests for TRUNCATE_SSAT_U/TRUNCATE_USAT_U with values like 0x7F00 that does not trigger the MOVI immediate encoding optimization.
Probably want to add computeKnownBits support for AArch64ISD::MOVIedit and related nodes, enabling constant propagation to unblock optimization like constant folding, which will use our optimization in selectionDAG.
I'll leave the todo out-of-scope for this PR.
| ✅ With the latest revision this PR passed the C/C++ code formatter. |
| @RKSimon This PR is becoming implementation for |
SelectionDAG::computeKnownBitsfor TRUNCATE_SSAT_S/U and TRUNCATE_USAT_UFixes #152143