- Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
I have a high-performance decompression function: https://godbolt.org/z/5cYberxnT (using array references and compiling for aarch64 here). I find that when I remove the [...; 256] array sizes and just use slices instead, I get vectorization and faster performance. In contrast, I would expect the array reference version to compile to the ~same thing, just with fewer bounds checks; if anything, knowing that the inputs have sufficient length should increase likelihood of vectorization.
To debug this, I've lined up what happens in the LLVM for the two approaches, right before the LoopVectorize pass (A=array references, B=slices):
A B -------------------------------------------- -------------------------------------------- %10 = phi i64 [0,%6], [%11,%9] (loop index handled externally in B) %11 = add %10, 1 %12 = gep i32, %4, %10 %23 = gep i32, %3, %13 %13 = load i32, %12 %24 = load i32, %23 %14 = gep i32, %3, %10 %33 = gep i32, %7, %13 %15 = load i32, %14 %34 = load i32, %33 %16 = add i32 %15, %7 %25 = add i32 %24, %10 %17 = lshr i32 %16, 3 %26 = lshr i32 %25, 3 %18 = and i32 %16, 7 %30 = and i32 %25, 7 %19 = zext i32 %17 %27 = zext i32 %26 %20 = gep i8, %1, %19 %28 = gep i8, %1, %27 %21 = load i64, %20 %29 = load i64, %28 %22 = zext i32 %18 %31 = zext i32 %30 %23 = lshr i64 %21, %22 %32 = lshr i64 %29, %31 %24 = and i32 %13, 63 %35 = and i32 %34, 63 %25 = zext i32 %24 %36 = zext i32 %35 %26 = shl i64 -1, %25 %37 = shl i64 -1, %36 %27 = xor i64 %26, -1 %38 = xor i64 %37, -1 %28 = and i64 %23, %27 %39 = and i64 %32, %38 %29 = gep i64, %5, %10 %40 = gep i64, %7, %13 %30 = load i64, %29 %41 = load i64, %40 %31 = add i64 %28, %30 %42 = add i64 %39, %41 store i64 %31, %29 store i64 %42, %40 %32 = icmp eq %11, 256 %43 = icmp eq %14, 256 br br The two loops are identical, except that A uses phi to iterate, whereas B uses an external iteration:
19: ; preds = %16 %20 = icmp eq i64 %13, %8 br i1 %20, label %44, label %22 Additionally, I got LLVM remarks for the LoopVectorize pass on both:
=====A===== note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (missed): the cost-model indicates that vectorization is not beneficial note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (missed): the cost-model indicates that interleaving is not beneficial =====B===== note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (analysis): the cost-model indicates that interleaving is not beneficial note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (success): vectorized loop (vectorization width: 4, interleaved count: 1) LLVM also has these annotations for the relevant input arrays:
====A==== ptr noalias noundef readonly align 4 captures(none) dereferenceable(1024) %3 ptr noalias noundef readonly align 4 captures(none) dereferenceable(1024) %4 ptr noalias noundef align 8 captures(none) dereferenceable(2048) %5 ====B==== ptr noalias noundef nonnull readonly align 4 captures(none) %3 ptr noalias noundef nonnull readonly align 4 captures(none) %5 ptr noalias noundef nonnull align 8 captures(none) %7 It seems the main difference is that B has nonnull for all of them.
I also checked the pre-opt-pipeline LLVM, and the only difference between them is that B uses icmp ult internally to the loop whereas A updates the loop variable externally. So according to my understanding, I see 3 possible paths forward, and any subset of them could be done:
- To my knowledge, it should be safe for Rust to annotate these array references as nonnull. That may fix vectorization in A.
- We could make Rust compile bounds-safe, array reference loops in the same way it does panicky slice loops (using
icmp ultwithin the loop). This might also help? But I don't know what the other consequences would be. - (outside of Rust) It may make sense to change LLVM's cost model.
Rust version: 1.91.0 (also applies in 1.89, 1.90)