Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?

Question

Sorry, I have never understood the rules here. I have deleted all the duplicate posts I have posted. This is the first related issue. Please do not mark this post as a duplicate of my other one(Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C), even if the code is somewhat similar, they raise very different questions.These are also two problems that I found on the same day. A similar post has been repeated by "misjudgment" and then closed. It may be that I did not clarify this issue. I really hope to get the answer, so I reposted it. I hope everyone can see the problem clearly, thank you very much!

In the following C code, I added an "if" statement to the loop of the first test time, and the execution times are exactly the same. In theory, it should be slower. Even though branch prediction can make their performance almost the same, it actually becomes a lot faster. What is the principle of this? I tried to use clang and gcc compilers to run in Mac and Linux environments respectively, and tried various optimization levels. In order to prevent the cache from being affected, I let the faster ones execute first, but the loops with redundant code execute faster.

If you think my description is not credible, please compile the following code into your computer and run it. Hope someone can answer this question for me，thanks.

C code:

#include <stdio.h> #include <time.h> #include <stdlib.h> #include <string.h> #define TLen 300000000 #define SLen 10 int main(int argc, const char * argv[]) { srandom((unsigned)time(NULL)); // An array to increase the index, // the range of its elements is 1-256 int rand_arr[128]; for (int i = 0; i < 128; ++i) rand_arr[i] = random()%256+1; // A random text(very long), the range of its elements is 0-127 char *tex = malloc((sizeof *tex) * TLen); for (int i = 0; i < TLen; ++i) tex[i] = random()%128; // A random string(very short)) char *str = malloc((sizeof *str) * SLen); for (int i = 0; i < SLen; ++i) str[i] = random()%128; // The first testing clock_t start = clock(); for (int i = 0; i < TLen; ){ if (!memcmp(tex+i, str, SLen)) printf("yes!\n"); i += rand_arr[tex[i]]; } clock_t end = clock(); printf("No.1: %lf s\n", ((double)(end - start)) / CLOCKS_PER_SEC); // The second testing start = clock(); for (int i = 0; i < TLen; ){ i += rand_arr[tex[i]]; } end = clock(); printf("No.2: %lf s\n", ((double)(end - start)) / CLOCKS_PER_SEC); return 0; }

I have run hundreds of times, almost all of this ratio. The following is a representative result of a test in Linux:

No.1: 0.110000 s No.2: 0.140000 s

After the first test all the data is already in the cache. In fact the whole calculation is repeating and theoretically can be optimized out completely. — Eugene Sh.
– Eugene Sh., Commented Sep 16, 2021 at 18:19
As in your previous question you aren't timing a long enough period for the differences to be meaningful. Function clock() is a blunt tool. — Weather Vane
– Weather Vane, Commented Sep 16, 2021 at 18:19
@EugeneSh. The data is already in cache before the first test. — ElderBug
– ElderBug, Commented Sep 16, 2021 at 18:21
@ Weather Vane Sorry, the previous code has been changed (some unnecessary things have been deleted), I forgot to change this. Has been changed. — Linke
– Linke, Commented Sep 16, 2021 at 18:32
Aside: Why use tex in str = malloc((sizeof *tex) * SLen);? — chux
– chux, Commented Sep 16, 2021 at 22:31

Peter Cordes · Accepted Answer · 2021-09-17 07:32:50Z

The key bottleneck is the load-use latency of cache misses while striding through tex[], as part of the dependency chain from old i to new i. Lookups from the small rand_array[] will hit in cache and just add about 5 cycles of L1d load-use latency to the loop-carried dependency chain, making it even more unlikely for anything other than latency through i to be relevant to the overall speed of the loop. (Jérôme Richard's answer on your previous question discusses that latency dep chain and the random stride increment.)

memcmp(tex+i, str, SLen) might be acting as prefetching if tex+i is near the end of a cache line. If you compiled with -O0, you're calling glibc memcmp so the SSE 16-byte or AVX2 32-byte load done in glibc memcmp crosses a cache-line boundary and pulls in the next line. (IIRC, glibc memcmp checks and avoids page crossing, but not cache-line splits, because those are relatively cheap on modern CPUs. Single-step into it to see. Into the 2nd call to avoid lazy dynamic linking, or compile with -fno-plt).

Possibly touching some cache lines that would have been skipped by the random stride helps HW prefetch be more aggressive and detect it as a sequential access pattern. A 32-byte AVX load is half a cache line, so it's fairly likely to cross into the next one.

So a more-predictable sequence of cache-line requests seen by the L2 cache is a plausible explanation for the memcmp version being faster. (Intel CPUs put the main prefetchers in the L2.)

If you compile with optimization, that 10-byte memcmp inlines and in the inner loop only touches 8 bytes. (If they match, then it jumps to another block where it checks the last 2. But it's exceedingly unlikely that ever happens).

This might be why -O0 is faster than -O3 on my system, with GCC11.1 on i7-6700k Skylake with DDR4-2666 DRAM, on Arch GNU/Linux. (Your question didn't bother to specify which compiler and options your numbers were from, or which hardware.)

$ gcc -O3 array-stride.c $ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out No.1: 0.041710 s No.2: 0.063072 s No.1: 0.040457 s No.2: 0.061166 s Performance counter stats for './a.out' (2 runs): 1,843.71 msec task-clock # 0.999 CPUs utilized ( +- 0.14% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 73,300 page-faults # 39.757 K/sec ( +- 0.00% ) 6,607,078,798 cycles # 3.584 GHz ( +- 0.06% ) 18,614,303,386 instructions # 2.82 insn per cycle ( +- 0.00% ) 19,407,689,229 uops_issued.any # 10.526 G/sec ( +- 0.02% ) 21,970,261,576 uops_executed.thread # 11.916 G/sec ( +- 0.02% ) 5,408,090,087 mem_load_retired.l1_hit # 2.933 G/sec ( +- 0.00% ) 3,861,687 mem_load_retired.l1_miss # 2.095 M/sec ( +- 2.11% ) 1.84537 +- 0.00135 seconds time elapsed ( +- 0.07% ) $ grep . /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference /sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance ... same on all 8 logical cores

The perf counter data is basically meaningless; it's for the whole run time, not the timed regions (which are about 0.1 sec out of 1.8 sec), so the majority of the time spent here was in glibc random(3) which uses "a nonlinear additive feedback random number generator employing a default table of size 31 long integers". Also in page-faults on the big malloc region.

It's only interesting in terms of the delta between two builds, but the loops outside the timed region still contribute many extra uops so it's still not as interesting as one would like.

vs. gcc -O0: No.1 is faster, No.2 is as expected a bit slower, with -O0 putting a store/reload into the dependency chain involving i.

$ gcc -O0 array-stride.c $ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r 2 ./a.out No.1: 0.028402 s No.2: 0.076405 s No.1: 0.028079 s No.2: 0.069492 s Performance counter stats for './a.out' (2 runs): 1,979.57 msec task-clock # 0.999 CPUs utilized ( +- 0.04% ) 0 context-switches # 0.000 /sec 0 cpu-migrations # 0.000 /sec 66,656 page-faults # 33.672 K/sec ( +- 0.00% ) 7,252,728,414 cycles # 3.664 GHz ( +- 0.02% ) 20,507,166,672 instructions # 2.83 insn per cycle ( +- 0.01% ) 22,268,130,378 uops_issued.any # 11.249 G/sec ( +- 0.00% ) 25,117,638,171 uops_executed.thread # 12.688 G/sec ( +- 0.00% ) 6,640,523,801 mem_load_retired.l1_hit # 3.355 G/sec ( +- 0.01% ) 3,350,518 mem_load_retired.l1_miss # 1.693 M/sec ( +- 1.39% ) 1.9810591 +- 0.0000934 seconds time elapsed ( +- 0.00% )

Remember this profiling is for the total run time, not the timed regions, so the 2.83 IPC is not for the timed region.

Out of order exec hiding cost of independent work

The actual throughput cost of running that macro-fused cmp/je uop doesn't add to any bottleneck, thanks to out-of-order exec. Or even a whole call memcmp@plt and setting up the args. The bottleneck is latency, not the front-end, or load ports in the back end, and the OoO exec window is deep enough to hide the memcmp work. See also the following to understand modern CPUs. (Yes, it will take a lot of reading to wrap your head around them.)

The loop-carried dependency through i goes i -> tex[i] -> indirectly back into i for next iteration, monotonically increasing it. tex[] is too big to fit in cache, and HW prefetch won't keep up with those strides.

So DRAM bandwidth might be the limiting factor, or even DRAM latency if HW prefetch doesn't detect and lock on to the somewhat-sequential access pattern to consecutive or every-other cache line.

The actual branch predicts perfectly, so it can execute (and confirm the prediction) when the load result arrives, in parallel with the stuff waiting for the sign-extending byte load that started at the same place.

and the execution times are exactly the same. ... it actually becomes a lot faster.

Huh? Your times don't show exactly the same. Yes, it does become measurably faster with that memcmp.

In theory, it should be slower.

Only if your theory is too simplistic to model modern CPUs. Not slowing it down is easily explainable because there is spare out-of-order exec throughput to do that independent work while waiting for the load latency.

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? (that's not how performance works; CPUs overlap the execution of multiple instructions; there is no single number cost you can assign to each instruction and then add them up).

Also possible related for -O0 performance:

Adding a redundant assignment speeds up code when compiled without optimization - Sandybridge-family store forwarding has variable latency, and not trying too soon can actually let you achieve better latency, i.e. less of a bottleneck. But I think the major reason -O0 memcmp is faster is the prefetch effect from doing a wider load.

For reference, the with-memcmp inner loops from GCC on Godbolt, matching what I benchmarked on my desktop.

# gcc11.1 -O0 -fPIE .L10: # do{ mov eax, DWORD PTR -16[rbp] movsx rdx, eax mov rax, QWORD PTR -32[rbp] lea rcx, [rdx+rax] mov rax, QWORD PTR -40[rbp] mov edx, 10 mov rsi, rax mov rdi, rcx call memcmp@PLT test eax, eax jne .L9 # normally taken jump over puts lea rax, .LC0[rip] mov rdi, rax call puts@PLT .L9: mov eax, DWORD PTR -16[rbp] movsx rdx, eax mov rax, QWORD PTR -32[rbp] add rax, rdx # strange not using an addressing mode, but ok movzx eax, BYTE PTR [rax] movsx eax, al # GCC -O0 is dumb, cdqe # really dumb. mov eax, DWORD PTR -576[rbp+rax*4] # look up in rand_array[] add DWORD PTR -16[rbp], eax # i += ... .L8: cmp DWORD PTR -16[rbp], 299999999 jle .L10 # }while(i<300000000)

vs. the -O3 code:

# RBP holds char *tex at at this point .L10: # do{ movsx r12, r13d # sign-extend i mov rax, QWORD PTR [rbx] # str[0..7] gets reloaded because alias analysis and the possible call to puts defeated the optimizer. Avoiding malloc for it may have helped. add r12, rbp # i+tex to avoid indexed addressing modes later? Probably not optimal cmp QWORD PTR [r12], rax # first 8 bytes of the memcmp je .L18 # code at .L18 checks next 2 and maybe does puts, before jumping back .L5: movsx rax, BYTE PTR [r12] # sign-extending byte load to pointer width, of tex[i] add r13d, DWORD PTR [rsp+rax*4] # i += look up in rand_array[] cmp r13d, 299999999 jle .L10 # }while(i < 300000000)

Since the je .L18 is never taken, this is 7 uops per iteration, so Skylake can comfortable issue it at under 2 cycles per iteration.

Even with L1d cache hits, the loop-carried dependency through i (R13D) is:

1 cycle: movsx r12, r13d
1 cycle: add r12, rbp
~5 cycle load-use latency: movsx rax, BYTE PTR [r12]. (Not 4c, that optimistic special case only applies when the reg in the simple addressing mode was itself the direct result of a load.)
~5 cycle load-use latency plus 1 cycle ALU latency: add r13d, DWORD PTR [rsp+rax*4]

So a total of about 13 cycle latency best case with L1d hits, leaving tons of free "slots" in the front-end, and idle back-end execution units, that even a call to actual glibc memcmp would be no big deal.

(Of course the -O0 code is noisier so some of those free slots in the pipeline area already used up, but the dep chain is even longer because -O0 code keeps i in memory: Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?)

CPU frequency in memory-bottlenecked loops, especially on Skylake

The init loops provide plenty of time for the CPU to get to max turbo before the timed region, and to pre-fault the allocated pages by touching them first. Idiomatic way of performance evaluation?

There might also be an effect of keeping CPU frequency higher if there's more work, less stalling, during those waits for incoming data from cache miss loads. See Slowing down CPU Frequency by imposing memory stress.

In the above results, I tested with EPP = balance_performance.

I also tested with performance, and -O0 (libc memcmp) is still faster than -O3 (inlined memcmp), but all 4 loops did speed up some (with/without memcmp, and optimized or not).

compiler/CPU	No.1 `if(memcmp)`	No.2 `plain`
`-O0` / `balance_performance`	0.028079 s	0.069492 s
`-O0` / `performance`	0.026898 s	0.053805 s
`-O3` / `balance_performance`	0.040457 s	0.061166 s
`-O3` / `performance`	0.035621 s	0.047475 s

The faster-at--O0 effect is still present and very significant even with EPP=performance, so we know it's not just a matter of clock speed. From previous experiments, performance keeps it pegged at max turbo even in memory-bound conditions where other settings would clock down from 4.2 to 2.7 GHz. So it's very likely that calling memcmp is helping trigger better prefetching to reduce average cache-miss latency.

I unfortunately didn't use a fixed seed for the PRNG so there may be some variation due to the randomness being good vs. bad for the prefetcher, in terms of patterns of whole cache lines accessed. And I just took one specific run for each (the 2nd one in the pair started by perf stat -r 2, so it should hopefully be even less affected by system fluctuation.)

Looks like performance made a bigger difference to the No.2 loop (with less going on in it), and also to the -O3 version (again with less going on in it), which is consistent with Skylake lowering clock speed when a core is pretty much only bottlenecking on memory, not running many other instructions, in EPP settings other than performance.

Also forgot to mention CPU frequency ramp up. The init loops give time for that to happen. Idiomatic way of performance evaluation?
Hmm, more work in the memory-bottleneck loop may actually be helping Skylake decide not to ramp down the CPU frequency during that portion, if you don't have EPP set to energy_performance_preference = performance (rather than balance_power or balance_performance). Slowing down CPU Frequency by imposing memory stress

HAL9000 · Accepted Answer · 2021-09-17 01:00:22Z

-1

Short answer, and pure speculation: In the first timed loop, the buffer pointed to by tex is still in cache after initialization. The printf does a lot of complicated stuff, including kernel calls, and any cache-lines are now very likely filled with other data.

The second timed loop now has to re-read the data from RAM.

Try to swap the loops, and/or delay the printf to after both measurements.

Oh, and the call to memcmp has a potential buffer overflow.

answered Sep 17, 2021 at 1:00

HAL9000

2,1881 gold badge11 silver badges20 bronze badges

4 Comments

Peter Cordes Over a year ago

Very unlikely that there are cache hits in either case; tex[] is ~300MB vs. even a big Xeon only has about 50MiB of L3, and a desktop more like 8MiB. The last writes of random() results were at the end, but the first timed loop loop starts reading it at the start. Also, lazy dynamic linking of clock() does a whole bunch of stuff between init and the first timed loop. (Compile with -fno-plt to use early binding, so just the code for clock() itself has to run.)

Peter Cordes Over a year ago

You're right about memcmp potentially reading up to 9 bytes past the end of tex[], though; I noticed that too but there wasn't a good spot in my already-long answer to mention bugs that weren't part of the perf effect.

HAL9000 Over a year ago

@PeterCordes, as I said, "Pure speculation", feel free to downvote.

Peter Cordes Over a year ago

Printing after both loops is probably better benchmark methodology in general so I won't actually downvote; reserving that for answers where the guesswork is based on misconceptions. On some OSes, I could even imagine waiting for a TTY write to finish might lead to CPU frequency dropping down to idle, or to migrating the process to another core if it wakes up the terminal handler on this one. (Linux with an X11 terminal emulator doesn't do that, but if anyone's testing on Windows, its console output might be a lot slower.)

Collectives™ on Stack Overflow

Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?

2 Answers 2

Out of order exec hiding cost of independent work

CPU frequency in memory-bottlenecked loops, especially on Skylake

2 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Out of order exec hiding cost of independent work

CPU frequency in memory-bottlenecked loops, especially on Skylake

2 Comments

4 Comments

Linked

Related