Skip to content

Using array references instead of slices can hurt performance #149293

@mwlon

Description

@mwlon

I have a high-performance decompression function: https://godbolt.org/z/5cYberxnT (using array references and compiling for aarch64 here). I find that when I remove the [...; 256] array sizes and just use slices instead, I get vectorization and faster performance. In contrast, I would expect the array reference version to compile to the ~same thing, just with fewer bounds checks; if anything, knowing that the inputs have sufficient length should increase likelihood of vectorization.

To debug this, I've lined up what happens in the LLVM for the two approaches, right before the LoopVectorize pass (A=array references, B=slices):

A B -------------------------------------------- -------------------------------------------- %10 = phi i64 [0,%6], [%11,%9] (loop index handled externally in B) %11 = add %10, 1 %12 = gep i32, %4, %10 %23 = gep i32, %3, %13 %13 = load i32, %12 %24 = load i32, %23 %14 = gep i32, %3, %10 %33 = gep i32, %7, %13 %15 = load i32, %14 %34 = load i32, %33 %16 = add i32 %15, %7 %25 = add i32 %24, %10 %17 = lshr i32 %16, 3 %26 = lshr i32 %25, 3 %18 = and i32 %16, 7 %30 = and i32 %25, 7 %19 = zext i32 %17 %27 = zext i32 %26 %20 = gep i8, %1, %19 %28 = gep i8, %1, %27 %21 = load i64, %20 %29 = load i64, %28 %22 = zext i32 %18 %31 = zext i32 %30 %23 = lshr i64 %21, %22 %32 = lshr i64 %29, %31 %24 = and i32 %13, 63 %35 = and i32 %34, 63 %25 = zext i32 %24 %36 = zext i32 %35 %26 = shl i64 -1, %25 %37 = shl i64 -1, %36 %27 = xor i64 %26, -1 %38 = xor i64 %37, -1 %28 = and i64 %23, %27 %39 = and i64 %32, %38 %29 = gep i64, %5, %10 %40 = gep i64, %7, %13 %30 = load i64, %29 %41 = load i64, %40 %31 = add i64 %28, %30 %42 = add i64 %39, %41 store i64 %31, %29 store i64 %42, %40 %32 = icmp eq %11, 256 %43 = icmp eq %14, 256 br br 

The two loops are identical, except that A uses phi to iterate, whereas B uses an external iteration:

19: ; preds = %16 %20 = icmp eq i64 %13, %8 br i1 %20, label %44, label %22 

Additionally, I got LLVM remarks for the LoopVectorize pass on both:

=====A===== note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (missed): the cost-model indicates that vectorization is not beneficial note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (missed): the cost-model indicates that interleaving is not beneficial =====B===== note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (analysis): the cost-model indicates that interleaving is not beneficial note: /rustc/29483883eed69d5fb4db01964cdf2af4d86e9cb2/library/core/src/iter/range.rs:765:12 loop-vectorize (success): vectorized loop (vectorization width: 4, interleaved count: 1) 

LLVM also has these annotations for the relevant input arrays:

====A==== ptr noalias noundef readonly align 4 captures(none) dereferenceable(1024) %3 ptr noalias noundef readonly align 4 captures(none) dereferenceable(1024) %4 ptr noalias noundef align 8 captures(none) dereferenceable(2048) %5 ====B==== ptr noalias noundef nonnull readonly align 4 captures(none) %3 ptr noalias noundef nonnull readonly align 4 captures(none) %5 ptr noalias noundef nonnull align 8 captures(none) %7 

It seems the main difference is that B has nonnull for all of them.

I also checked the pre-opt-pipeline LLVM, and the only difference between them is that B uses icmp ult internally to the loop whereas A updates the loop variable externally. So according to my understanding, I see 3 possible paths forward, and any subset of them could be done:

  • To my knowledge, it should be safe for Rust to annotate these array references as nonnull. That may fix vectorization in A.
  • We could make Rust compile bounds-safe, array reference loops in the same way it does panicky slice loops (using icmp ult within the loop). This might also help? But I don't know what the other consequences would be.
  • (outside of Rust) It may make sense to change LLVM's cost model.

Rust version: 1.91.0 (also applies in 1.89, 1.90)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-codegenArea: Code generationC-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions