Improve slice::binary_search_by #127007

krtab · 2024-06-26T21:33:42Z

This PR aims to improve the performances of std::slice::binary_search.

EDIT: The proposed implementation changed so the rest of this comment is outdated. See #127007 (comment) for an up to date presentation of the PR.

It reduces the total instruction count for the u32 monomorphization, but maybe more remarkably, removes 2 of the 12 instructions of the main loop (on x86).

It changes test_binary_search_implementation_details() so may warrant a crater run.

I will document it much more if this is shown to be interesting on benchmarks. Could we start with a timer run first?

Before the PR

 mov eax, 1  test rsi, rsi  je .LBB0_1  mov rcx, rdx  mov rdx, rsi  mov ecx, dword ptr [rcx]  xor esi, esi  mov r8, rdx .LBB0_3:  shr rdx  add rdx, rsi  mov r9d, dword ptr [rdi + 4*rdx]  cmp r9d, ecx  je .LBB0_4  lea r10, [rdx + 1]  cmp r9d, ecx  cmova r8, rdx  cmovb rsi, r10  mov rdx, r8  sub rdx, rsi  ja .LBB0_3  mov rdx, rsi  ret .LBB0_1:  xor edx, edx  ret .LBB0_4:  xor eax, eax  ret

After the PR

 mov ecx, dword ptr [rdx]  xor eax, eax  xor edx, edx .LBB1_1:  cmp rsi, 1  jbe .LBB1_2  mov r9, rsi  shr r9  lea r8, [r9 + rdx]  sub rsi, r9  cmp dword ptr [rdi + 4*r8], ecx  cmovb rdx, r8  cmova rsi, r9  jne .LBB1_1  mov rdx, r8  ret .LBB1_2:  test rsi, rsi  je .LBB1_3  xor eax, eax  cmp dword ptr [rdi + 4*rdx], ecx  setne al  adc rdx, 0  ret .LBB1_3:  mov eax, 1  ret

rustbot · 2024-06-26T21:33:50Z

r? @m-ou-se

rustbot has assigned @m-ou-se.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

krtab · 2024-06-26T23:07:27Z

I've run benches with ./x bench core --test-args binary_search and so far this PR is unfortunately not improving. This is likely because the added number of iterations is not compensated by the reduced number of iterations.

krtab · 2024-06-27T08:43:52Z

A different strategy bore fruits! It relies on the fact that except for ZST, slice.len is less than isize::MAX and so (right + left)/2 is an ok way to compute the mean. By treating the ZST case separately, we can "revert" 3eb5bee and win some instructions.

ZST comparing to Equal special case

Assembly:

Before

 mov rdx, rsi  xor eax, eax  test rsi, rsi  sete al  shr rdx  ret

After

 xor eax, eax  test rsi, rsi  sete al  xor edx, edx  ret

usize on x86_64

Assembly

Before

example::custom_binary_search_gt::hb56715903c41dc94:  mov eax, 1  test rsi, rsi  je .LBB0_1  mov rcx, rdx  mov rdx, rsi  mov rcx, qword ptr [rcx]  xor esi, esi  mov r8, rdx .LBB0_3:  shr rdx  add rdx, rsi  mov r9, qword ptr [rdi + 8*rdx]  cmp r9, rcx  je .LBB0_4  lea r10, [rdx + 1]  cmp r9, rcx  cmova r8, rdx  cmovb rsi, r10  mov rdx, r8  sub rdx, rsi  ja .LBB0_3  mov rdx, rsi  ret .LBB0_1:  xor edx, edx  ret .LBB0_4:  xor eax, eax  ret

After

example::custom_binary_search_1::hc0743890bf3b6f85:  mov rcx, qword ptr [rdx]  xor eax, eax  xor edx, edx .LBB1_1:  cmp rsi, rdx  jbe .LBB1_2  lea r8, [rsi + rdx]  shr r8  lea r9, [r8 + 1]  cmp qword ptr [rdi + 8*r8], rcx  cmovb rdx, r9  cmova rsi, r8  jne .LBB1_1  mov rdx, r8  ret .LBB1_2:  mov eax, 1  ret

Benches

Before

$ ./x bench core --test-args binary_search benchmarks: slice::binary_search_l1 20.69/iter +/- 0.45 slice::binary_search_l1_with_dups 12.17/iter +/- 0.06 slice::binary_search_l1_worst_case 9.21/iter +/- 0.03 slice::binary_search_l2 31.81/iter +/- 0.25 slice::binary_search_l2_with_dups 21.05/iter +/- 0.29 slice::binary_search_l2_worst_case 15.46/iter +/- 0.32 slice::binary_search_l3 75.05/iter +/- 0.95 slice::binary_search_l3_with_dups 53.66/iter +/- 0.89 slice::binary_search_l3_worst_case 27.74/iter +/- 0.03

After

$ ./x bench core --test-args binary_search benchmarks: slice::binary_search_l1 20.19/iter +/- 3.60 slice::binary_search_l1_with_dups 14.24/iter +/- 0.57 slice::binary_search_l1_worst_case 8.77/iter +/- 3.00 slice::binary_search_l2 29.81/iter +/- 0.64 slice::binary_search_l2_with_dups 19.45/iter +/- 1.32 slice::binary_search_l2_worst_case 14.08/iter +/- 0.65 slice::binary_search_l3 70.80/iter +/- 6.43 slice::binary_search_l3_with_dups 52.35/iter +/- 2.77 slice::binary_search_l3_worst_case 23.62/iter +/- 2.73

lqd · 2024-06-29T08:33:19Z

Could we start with a timer run first?

@bors try @rust-timer queue

bors · 2024-06-29T08:34:29Z

⌛ Trying commit 2f5eec9 with merge 3dc1b1e...

Improve slice::binary_search_by This PR aims to improve the performances of std::slice::binary_search. **EDIT: The proposed implementation changed so the rest of this comment is outdated. See rust-lang#127007 (comment) for an up to date presentation of the PR.** It reduces the total instruction count for the `u32` monomorphization, but maybe more remarkably, removes 2 of the 12 instructions of the main loop (on x86). It changes `test_binary_search_implementation_details()` so may warrant a crater run. I will document it much more if this is shown to be interesting on benchmarks. Could we start with a timer run first? **Before the PR** ```asm mov eax, 1 test rsi, rsi je .LBB0_1 mov rcx, rdx mov rdx, rsi mov ecx, dword ptr [rcx] xor esi, esi mov r8, rdx .LBB0_3: shr rdx add rdx, rsi mov r9d, dword ptr [rdi + 4*rdx] cmp r9d, ecx je .LBB0_4 lea r10, [rdx + 1] cmp r9d, ecx cmova r8, rdx cmovb rsi, r10 mov rdx, r8 sub rdx, rsi ja .LBB0_3 mov rdx, rsi ret .LBB0_1: xor edx, edx ret .LBB0_4: xor eax, eax ret ``` **After the PR** ```asm mov ecx, dword ptr [rdx] xor eax, eax xor edx, edx .LBB1_1: cmp rsi, 1 jbe .LBB1_2 mov r9, rsi shr r9 lea r8, [r9 + rdx] sub rsi, r9 cmp dword ptr [rdi + 4*r8], ecx cmovb rdx, r8 cmova rsi, r9 jne .LBB1_1 mov rdx, r8 ret .LBB1_2: test rsi, rsi je .LBB1_3 xor eax, eax cmp dword ptr [rdi + 4*rdx], ecx setne al adc rdx, 0 ret .LBB1_3: mov eax, 1 ret ```

bors · 2024-06-29T10:13:58Z

☀️ Try build successful - checks-actions
Build commit: 3dc1b1e (3dc1b1e098adb3f6866155993a91f1cad3513a9c)

rust-timer · 2024-06-29T11:57:52Z

Finished benchmarking commit (3dc1b1e): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.5%	[-0.5%, -0.5%]	1
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary 0.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.5%	[2.5%, 2.5%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.5%	[-2.5%, -2.5%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.0%	[-2.5%, 2.5%]	2

Cycles

Results (primary -2.6%, secondary 2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.1%	[2.1%, 2.1%]	1
Improvements ✅ (primary)	-2.6%	[-2.6%, -2.6%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-2.6%	[-2.6%, -2.6%]	1

Binary size

Results (primary -0.0%, secondary -0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.2%]	4
Regressions ❌ (secondary)	0.5%	[0.5%, 0.5%]	1
Improvements ✅ (primary)	-0.1%	[-0.2%, -0.1%]	6
Improvements ✅ (secondary)	-0.1%	[-0.2%, -0.0%]	41
All ❌✅ (primary)	-0.0%	[-0.2%, 0.2%]	10

Bootstrap: 694.857s -> 695.931s (0.15%)
Artifact size: 324.45 MiB -> 324.42 MiB (-0.01%)

Mili-ssm · 2024-06-29T17:53:22Z

Hi, i discover this PR this morning, yesterday a post on zulip about some ideas on how to improve the STD binary search. If it is worth or usefull here is the post on zulip Proposal for Binary Search Algorithm Enhancing.

I am not sure if is desirable, better or worst but if the algorith is under reviw maybe is helpful.

Except for ZSTs, `[T].len()` is less than `isize::MAX`. By treating the ZST case separately, we can "revert" 3eb5bee, remove the `size` local variable and win two instructions.

krtab · 2024-07-01T11:59:26Z

All benchmarks are good and I added comments, so this is ready for review.

okaneco · 2024-07-01T13:59:14Z

library/core/src/slice/mod.rs

+ // SAFETY: We have that left < right, so
+ // 0 <= left <= mid < right <= self.len()
+ // and the indexing is in-bounds.


Within the safety comment, there could be an explicit mention of where the mid relations come from.
As a reader, it's easier to evaluate the logic in the comment without needing to consult the line of code above at the same time.

Suggested change

// SAFETY: We have that left < right, so

// 0 <= left <= mid < right <= self.len()

// and the indexing is in-bounds.

// SAFETY: The while condition ensures left < right,

// so `left <= (left + right) / 2 < right`. Thus,

// 0 <= left <= mid < right <= self.len()

// and the indexing is in-bounds.

Miksel12 · 2024-07-01T18:00:58Z

This binary search implementation by @orlp might also be interesting to look at: https://orlp.net/blog/bitwise-binary-search/

orlp · 2024-07-01T19:01:57Z

@Miksel12 That article is mostly for educational purposes. I believe this to be the fastest pattern possible: https://mhdm.dev/posts/sb_lower_bound/#update (if you can get the compiler to generate cmovs).

krtab · 2024-07-02T10:58:32Z

@Miksel12 That article is mostly for educational purposes. I believe this to be the fastest pattern possible: https://mhdm.dev/posts/sb_lower_bound/#update (if you can get the compiler to generate cmovs).

The massive difference is that this version does not have an early return for the equality case. Removing this early return from our implementation would also bring the instruction count down but I don't think this is what we want.

orlp · 2024-07-02T11:04:44Z

@krtab I don't think the early return is valuable. It only comes into play if you're exceptionally lucky, and then it only saves O(log n) iterations in the best case. It adds an extra branch to every iteration every time you're not exceptionally lucky.

To be more precise, it is expected that the number of comparisons you save for random data is 1/2 * (0 + 1/2 * (1 + 1/2 * 2 + ... = 1/2 + 1/4 + 1/8 + 1/16 + ... = 1. That's right, the early return on average for random data only saves a single comparison.

For what it's worth, in my experience you almost always want partition_point and not binary_search.

Amanieu · 2024-07-26T23:39:45Z

In #128254 I actually revert to the old algorithm which turns out to be much faster once you ensure that LLVM doesn't turn CMOV into branches.

bors · 2024-08-02T11:05:54Z

☔ The latest upstream changes (presumably #128254) made this pull request unmergeable. Please resolve the merge conflicts.

Amanieu · 2024-09-18T16:10:32Z

Superseded by #128254. However I still want to thank @krtab for your effort in looking into this.

rustbot assigned m-ou-se Jun 26, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 26, 2024

krtab force-pushed the improv_binary_search branch 2 times, most recently from 7aa251c to e4a10ae Compare June 26, 2024 21:34

krtab marked this pull request as draft June 26, 2024 21:48