Return to Answer

Improved formatting

edited Apr 29 at 22:43

258.7k
21
370
830

Time of `cReverseDot[Real][v]]` divided by the time of `v.Reverse[v]`, with `v` not in cache. Green indicates `cReverseDot[]` was faster, and red indicates `Dot[]` + `Reverse[]` was faster. `cReverseDot[]` is usually faster on shorter vectors, and they're roughly tied on longer ones. The horizontal axis ticks are `Log2[]` of the size.

Time of cReverseDot[Real][v]] divided by the time of v.Reverse[v], with v not in cache. Green indicates cReverseDot[] was faster, and red indicates Dot[] + Reverse[] was faster. cReverseDot[] is usually faster on shorter vectors, and they're roughly tied on longer ones. The horizontal axis ticks are Log2[] of the size.

Time of `cReverseDot[Real][v]]` divided by the time of `v.Reverse[v]`, with `v` in cache (at least after the first timing `RepeatedTiming[]` makes). In this scenario, `v.Reverse[v]` is consistently slightly faster than `cReverseDot[]` for longer vectors. The spikes in the ratio around lengths 2^17 and 2^20 appear to be consistent from run to run. The horizontal axis ticks are `Log2[]` of the size.

Time of cReverseDot[Real][v]] divided by the time of v.Reverse[v], with v in cache (at least after the first timing RepeatedTiming[] makes). In this scenario, v.Reverse[v] is consistently slightly faster than cReverseDot[] for longer vectors. The spikes in the ratio around lengths 2^17 and 2^20 appear to be consistent from run to run. The horizontal axis ticks are Log2[] of the size.

Time of `cReverseDot[Real][v]]` divided by the time of `v.Reverse[v]`, with `v` not in cache. Green indicates `cReverseDot[]` was faster, and red indicates `Dot[]` + `Reverse[]` was faster. `cReverseDot[]` is usually faster on shorter vectors, and they're roughly tied on longer ones. The horizontal axis ticks are `Log2[]` of the size.

Time of `cReverseDot[Real][v]]` divided by the time of `v.Reverse[v]`, with `v` in cache (at least after the first timing `RepeatedTiming[]` makes). In this scenario, `v.Reverse[v]` is consistently slightly faster than `cReverseDot[]` for longer vectors. The spikes in the ratio around lengths 2^17 and 2^20 appear to be consistent from run to run. The horizontal axis ticks are `Log2[]` of the size.

Time of cReverseDot[Real][v]] divided by the time of v.Reverse[v], with v not in cache. Green indicates cReverseDot[] was faster, and red indicates Dot[] + Reverse[] was faster. cReverseDot[] is usually faster on shorter vectors, and they're roughly tied on longer ones. The horizontal axis ticks are Log2[] of the size.

Time of cReverseDot[Real][v]] divided by the time of v.Reverse[v], with v in cache (at least after the first timing RepeatedTiming[] makes). In this scenario, v.Reverse[v] is consistently slightly faster than cReverseDot[] for longer vectors. The spikes in the ratio around lengths 2^17 and 2^20 appear to be consistent from run to run. The horizontal axis ticks are Log2[] of the size.

Source Link

answered Apr 29 at 21:48

Michael E2

258.7k
21
370
830

Can I answer the first question?:

What's happening under the hood to explain all this?

Um, no. 2. Well, maybe. 3. IDK, see what you think...

There are many factors: packed arrays, BLAS, SIMD, FMA, vectorization thresholds, memory read & cache, Mma overheads...perhaps the OS/CPU -- plus how a run is timed. I cannot explain it all.

Much of the time in v.Reverse[v] is spent on Reverse[v], at least in a large packed v. Or to use a phrase I'm less comfortable with, the operation is memory bound -- or is it bandwidth? Doing clever things before passing the product off to BLAS adds overheads, and apparently they are sometimes significant. In particular, one expects the overheads to be less significant and the clever things more advantageous when the vector length is below vectorization thresholds, and the reverse if it's longer. Also, in the Compile[] environment, these overheads are much less, except when MainEvaluate[] is invoked.

Some tips:

Avoid copying tensors. This is the problem with Reverse[v] in v.Reverse[v]. State-of-the-art CPUs are optimized for Dot[]. And for copying memory, for that matter. But if they were optimized for reading an array backwards, the time for v.Reverse[v] would be cut nearly in half. (I think.)
Keep the array in cache. I don't know how to control that, but it makes a difference.
Minimize symbolic processing. Symbolic manipulation is a great strength, but it's slow. Even If[], Less[], etc. are a bit slower than I expect. Inside Compile[], they're much faster.
Use Compile[] to minimize overhead. Henrik takes maximum advantage of this in his answer. Not everything can be compiled, though.

Comparison of Henrik's cReverseDot[Real][v] and v.Reverse[v]

I present both uncached and cached timings. The uncached timing is tricky, and someone may have a better idea for measuring it. I need to generate a new vector at each iteration. Afterwards, I subtract the time it takes to generate the new vectors. Due to variability in the run-time as measured by AbsoluteTiming, there is some uncertainty in the results.

$Version (* "14.2.1 for Mac OS X ARM (64-bit) (March 16, 2025)" *) cReverseDot[Real]; (* precompile *) Table[ t2 = (Do[v = RandomReal[10, 2^k]; cReverseDot[Real][v];, Ceiling[1000/k]] // AbsoluteTiming) - (Do[v = RandomReal[10, 2^k];, Ceiling[1000/k]] // AbsoluteTiming) // First; t1 = (Do[v = RandomReal[10, 2^k]; v . Reverse[v];, Ceiling[1000/k]] // AbsoluteTiming) - (Do[v = RandomReal[10, 2^k];, Ceiling[1000/k]] // AbsoluteTiming) // First; t2/t1, {k, 20}] // ListLinePlot[#, PlotRange -> All, MeshFunctions -> {#2 &}, Mesh -> {{1}}, MeshShading -> {Darker@Green, Blend[{Magenta, Red}]}] &

Time of `cReverseDot[Real][v]]` divided by the time of `v.Reverse[v]`, with `v` not in cache. Green indicates `cReverseDot[]` was faster, and red indicates `Dot[]` + `Reverse[]` was faster. `cReverseDot[]` is usually faster on shorter vectors, and they're roughly tied on longer ones. The horizontal axis ticks are `Log2[]` of the size.

Table[ v = Table[RandomReal[10], 2^k]; First@Divide[ cReverseDot[Real][v]; // RepeatedTiming, v . Reverse[v]; // RepeatedTiming ], {k, 24}] // ListLinePlot[#, PlotRange -> All, MeshFunctions -> {#2 &}, Mesh -> {{1}}, MeshShading -> {Darker@Green, Blend[{Magenta, Red}]}] &

Time of `cReverseDot[Real][v]]` divided by the time of `v.Reverse[v]`, with `v` in cache (at least after the first timing `RepeatedTiming[]` makes). In this scenario, `v.Reverse[v]` is consistently slightly faster than `cReverseDot[]` for longer vectors. The spikes in the ratio around lengths 2^17 and 2^20 appear to be consistent from run to run. The horizontal axis ticks are `Log2[]` of the size.

Presumably, Henrik's code executes roughly half the FLOPs that v.Reverse[v] does. It also avoids the copying v that occurs in Reverse[v]. But this algorithmic advantage does not beat out the advantage of vectorization in longer vectors.

Q&A

How can you be sure when v is in the cache? I can't. Maybe someone can. The timing allows me, after the fact, to hazard the inference. I assume that when v is used in a calculation, then it is loaded into cache. If it is used again shortly thereafter, it may still be cached, and we win.
Which of the two methods of timing is the right one? I don't know. Maybe both. It depends whether in your program the data is cached. It depends on whether you want to consider the memory management as part of what needs to be timed. If not, it seems RepeatedTiming[] is more likely to return a timing less influenced by memory reads.

More cached/uncached comparisons

Differences in Reverse[], Plus[], and Times[]:

size = 2^20; Grid[{ v = RandomReal[10, size]; w = RandomReal[10, size]; Reverse[v]; // AbsoluteTiming // First // {"Not cached", Reverse, #} &, Reverse[v]; // AbsoluteTiming // First // {"Cached", Reverse, #} &, v = RandomReal[10, size]; w = RandomReal[10, size]; v + w; // AbsoluteTiming // First // {"Not cached", Plus, #} &, v = RandomReal[10, size]; w = RandomReal[10, size]; v*w; // AbsoluteTiming // First // {"Not cached", Times, #} &, v + w; // AbsoluteTiming // First // {"Cached", Plus, #} &, v*w; // AbsoluteTiming // First // {"Cached", Times, #} &, v + w; // RepeatedTiming // First // {"Cached, rep", Plus, #} &, v*w; // RepeatedTiming // First // {"Cached, rep", Times, #} & }, Alignment -> {Left}]