Speeding up v.Reverse[v] or ListConvolve[v,v]

Question

What is the most efficient way to compute v.Reverse[v] or ListConvolve[v,v]? In particular, since most of the products occur twice, could there be a 2x speed improvement? My attempt:

f1[v_] := v . Reverse[v] f2[v_] := First@ListConvolve[v, v] f3[v_] := f3i[v, Length[v]] f3i[v_, s_?EvenQ] := With[{hs = s/2}, (2 Take[v, hs]) . Reverse[Take[v, -hs]]] f3i[v_, s_?OddQ] := f3i[v, s - 1] + v[[(s + 1)/2]]^2 f4[v_] := f4i[v, Length[v]] f4i[v_, s_?EvenQ] := With[{hs = s/2}, First@ListConvolve[2 Take[v, hs], Take[v, -hs]]] f4i[v_, s_?OddQ] := f4i[v, s - 1] + v[[(s + 1)/2]]^2

First, a sanity check:

v = {{}, {a}, {a, b}, {a, b, c}, {a, b, c, d}}; {f1[#], f1[#] === f2[#] === f3[#] === f4[#]} & /@ v

{{0, True}, {a^2, True}, {2 a b, True}, {b^2 + 2 a c, True}, {2 b c + 2 a d, True}}

Timing on symbols:

v = Table[StringForm["a``", i], {i, 5000000}]; First@RepeatedTiming[f1[v]] First@RepeatedTiming[f2[v]] First@RepeatedTiming[f3[v]] First@RepeatedTiming[f4[v]]

2.16298 $\quad$ (plain v.Reverse[v])

2.48454 $\quad$ (plain ListConvolve[v,v])

1.56501 $\quad$ (modified v.Reverse[v])

1.43643 $\quad$ (modified ListConvolve[v,v])

Not a 2x improvement, but an improvement nonetheless! Notice how plain v.Reverse[v] beats ListConvolve[v,v], but the opposite is true with the modified versions. Why?

Timing on integers:

v = Table[RandomInteger[100000], 5000000]; First@RepeatedTiming[f1[v]] First@RepeatedTiming[f2[v]] First@RepeatedTiming[f3[v]] First@RepeatedTiming[f4[v]]

0.00252993 $\quad$ (plain v.Reverse[v])

0.563211 $\quad$ (plain ListConvolve[v,v])

0.00335379 $\quad$ (modified v.Reverse[v])

0.287567 $\quad$ (modified ListConvolve[v,v])

With integers, v.Reverse[v] dramatically beats ListConvolve[v,v], but trying to improve v.Reverse[v] just makes it worse; on the other hand, the sought 2x improvement on ListConvolve[v,v] seems achievable in this case.

Timing on reals:

v = Table[RandomReal[], 5000000]; First@RepeatedTiming[f1[v]] First@RepeatedTiming[f2[v]] First@RepeatedTiming[f3[v]] First@RepeatedTiming[f4[v]]

0.00255105 $\quad$ (plain v.Reverse[v])

0.0221051 $\quad$ (plain ListConvolve[v,v])

0.00357253 $\quad$ (modified v.Reverse[v])

0.0133805 $\quad$ (modified ListConvolve[v,v])

We see v.Reverse[v] speed is the same for integers and reals (and again the attempt at improving it just makes it worse), but ListConvolve[v,v] is much faster with reals than integers, and can again be improved, but that improvement is no longer 2x.

Question/Challenge: What's happening under the hood to explain all this? Can you construct a variant of v.Reverse[v] that's twice as fast on integers and reals?

Try f4i with With[{hs = s/2}, 2 First@ListConvolve[Take[v, hs], Take[v, -hs]]]. Similarly for f3i. — Michael E2
– Michael E2, Commented Apr 27 at 18:36
Ah yes, pulling out the 2 increases performance! On symbolic expressions this introduces the typically insignificant distinction 2(bc+ad) versus 2bc+2ad, while on numeric expressions a possible change in roundoff doesn't matter because the various OP versions anyhow probably already disagreed in that regard. — Alexander Perlis
– Alexander Perlis, Commented May 7 at 17:07

Henrik Schumacher · Accepted Answer · 2025-04-26 19:44:36Z

For standard types like Real, Integer, and Complex you can use Compile to create optimized versions.

Here is a variant that does that for all three types.

ClearAll[cReverseDot]; cReverseDot[type : Real | Integer | Complex] := cReverseDot[type] = With[{ T = Blank[type], init = Switch[type, Integer, 0, Real, 0., Complex, 0. + 0. I], two = Switch[type, Integer, 2, Real, 2., Complex, 2.] }, Compile[{{v, T, 1}}, Module[{n, sum, k}, n = Length[v]; k = Quotient[n, 2]; sum = If[n == 2 k, init, Compile`GetElement[v, k + 1] Compile`GetElement[v, k + 1]]; Do[ sum += two (Compile`GetElement[v, i] Compile`GetElement[v, n + 1 - i]), {i, k, 1, -1} ]; sum ], CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed" ] ];

The first run in each Mathematica session will take longer because the function has to be compiled once. Afterwards it is often 10-20 times faster than ordinary Mathematica code.

Here a usage example that highlights that the function is also Listable to some extent:

m = 1000000; n = 10; X = RandomReal[{-5, 5}, {m, n}]; Yc = cReverseDot[Real][X]; // AbsoluteTiming // First Y1 = f1 /@ X; // AbsoluteTiming // First Max[Abs[Y1 - Yc]]/Max[Abs[Y1]]

0.023744

0.593634

3.15487*10^-16

So cReverseDot is 25 times faster, when applied to many short vectors. It won't be better on very long vectors, though.

If the vector length is known already at compile time and if it is not too large, then one can improve on this as follows:

ClearAll[cReverseDotFixedSize]; cReverseDotFixedSize[type : Real | Integer | Complex, size_Integer] := cReverseDotFixedSize[type, size] = With[{ T = Blank[type], init = Switch[type, Integer, 0, Real, 0., Complex, 0. + 0. I], two = Switch[type, Integer, 2, Real, 2., Complex, 2.], k = Quotient[size, 2], n = size }, If[OddQ[n], Compile[{{v, T, 1}}, Module[{sum}, sum = Compile`GetElement[v, k + 1] Compile`GetElement[v, k + 1]; Do[ sum += two (Compile`GetElement[v, i] Compile`GetElement[v, n + 1 - i]), {i, k, 1, -1} ]; sum ], CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed" ], Compile[{{v, T, 1}}, Module[{sum}, sum = init; Do[ sum += two (Compile`GetElement[v, i] Compile`GetElement[v, n + 1 - i]), {i, k, 1, -1} ]; sum ], CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed" ] ] ];

Now

Yc = cReverseDot[Real][X]; // AbsoluteTiming // First Yf = cReverseDotFixedSize[Real, n][X]; // AbsoluteTiming // First Max[Abs[Yf - Yc]]/Max[Abs[Yc]]

0.030696

0.007259

0.

So the fixed-size version is about 4.2 times faster. This is because fixed vector size allows the compiler to write shorter code with less control constructs and it makes it also easier to vectorize the code.

It may interest the OP, and maybe others, that # . Reverse[#] & /@ X is quite a bit faster than f1 /@ X (auto-compilation). — Michael E2
– Michael E2, Commented Apr 27 at 17:16

Michael E2 · Accepted Answer · 2025-04-29 22:43:52Z

Can I answer the first question?:

What's happening under the hood to explain all this?

Um, no. 2. Well, maybe. 3. IDK, see what you think...

There are many factors: packed arrays, BLAS, SIMD, FMA, vectorization thresholds, memory read & cache, Mma overheads...perhaps the OS/CPU -- plus how a run is timed. I cannot explain it all.

Much of the time in v.Reverse[v] is spent on Reverse[v], at least in a large packed v. Or to use a phrase I'm less comfortable with, the operation is memory bound -- or is it bandwidth? Doing clever things before passing the product off to BLAS adds overheads, and apparently they are sometimes significant. In particular, one expects the overheads to be less significant and the clever things more advantageous when the vector length is below vectorization thresholds, and the reverse if it's longer. Also, in the Compile[] environment, these overheads are much less, except when MainEvaluate[] is invoked.

Some tips:

Avoid copying tensors. This is the problem with Reverse[v] in v.Reverse[v]. State-of-the-art CPUs are optimized for Dot[]. And for copying memory, for that matter. But if they were optimized for reading an array backwards, the time for v.Reverse[v] would be cut nearly in half. (I think.)
Keep the array in cache. I don't know how to control that, but it makes a difference.
Minimize symbolic processing. Symbolic manipulation is a great strength, but it's slow. Even If[], Less[], etc. are a bit slower than I expect. Inside Compile[], they're much faster.
Use Compile[] to minimize overhead. Henrik takes maximum advantage of this in his answer. Not everything can be compiled, though.

Comparison of Henrik's cReverseDot[Real][v] and v.Reverse[v]

I present both uncached and cached timings. The uncached timing is tricky, and someone may have a better idea for measuring it. I need to generate a new vector at each iteration. Afterwards, I subtract the time it takes to generate the new vectors. Due to variability in the run-time as measured by AbsoluteTiming, there is some uncertainty in the results.

$Version (* "14.2.1 for Mac OS X ARM (64-bit) (March 16, 2025)" *) cReverseDot[Real]; (* precompile *) Table[ t2 = (Do[v = RandomReal[10, 2^k]; cReverseDot[Real][v];, Ceiling[1000/k]] // AbsoluteTiming) - (Do[v = RandomReal[10, 2^k];, Ceiling[1000/k]] // AbsoluteTiming) // First; t1 = (Do[v = RandomReal[10, 2^k]; v . Reverse[v];, Ceiling[1000/k]] // AbsoluteTiming) - (Do[v = RandomReal[10, 2^k];, Ceiling[1000/k]] // AbsoluteTiming) // First; t2/t1, {k, 20}] // ListLinePlot[#, PlotRange -> All, MeshFunctions -> {#2 &}, Mesh -> {{1}}, MeshShading -> {Darker@Green, Blend[{Magenta, Red}]}] &

Time of cReverseDot[Real][v]] divided by the time of v.Reverse[v], with v not in cache. Green indicates cReverseDot[] was faster, and red indicates Dot[] + Reverse[] was faster. cReverseDot[] is usually faster on shorter vectors, and they're roughly tied on longer ones. The horizontal axis ticks are Log2[] of the size.

Table[ v = Table[RandomReal[10], 2^k]; First@Divide[ cReverseDot[Real][v]; // RepeatedTiming, v . Reverse[v]; // RepeatedTiming ], {k, 24}] // ListLinePlot[#, PlotRange -> All, MeshFunctions -> {#2 &}, Mesh -> {{1}}, MeshShading -> {Darker@Green, Blend[{Magenta, Red}]}] &

Time of cReverseDot[Real][v]] divided by the time of v.Reverse[v], with v in cache (at least after the first timing RepeatedTiming[] makes). In this scenario, v.Reverse[v] is consistently slightly faster than cReverseDot[] for longer vectors. The spikes in the ratio around lengths 2^17 and 2^20 appear to be consistent from run to run. The horizontal axis ticks are Log2[] of the size.

Presumably, Henrik's code executes roughly half the FLOPs that v.Reverse[v] does. It also avoids the copying v that occurs in Reverse[v]. But this algorithmic advantage does not beat out the advantage of vectorization in longer vectors.

Q&A

How can you be sure when v is in the cache? I can't. Maybe someone can. The timing allows me, after the fact, to hazard the inference. I assume that when v is used in a calculation, then it is loaded into cache. If it is used again shortly thereafter, it may still be cached, and we win.
Which of the two methods of timing is the right one? I don't know. Maybe both. It depends whether in your program the data is cached. It depends on whether you want to consider the memory management as part of what needs to be timed. If not, it seems RepeatedTiming[] is more likely to return a timing less influenced by memory reads.

More cached/uncached comparisons

Differences in Reverse[], Plus[], and Times[]:

size = 2^20; Grid[{ v = RandomReal[10, size]; w = RandomReal[10, size]; Reverse[v]; // AbsoluteTiming // First // {"Not cached", Reverse, #} &, Reverse[v]; // AbsoluteTiming // First // {"Cached", Reverse, #} &, v = RandomReal[10, size]; w = RandomReal[10, size]; v + w; // AbsoluteTiming // First // {"Not cached", Plus, #} &, v = RandomReal[10, size]; w = RandomReal[10, size]; v*w; // AbsoluteTiming // First // {"Not cached", Times, #} &, v + w; // AbsoluteTiming // First // {"Cached", Plus, #} &, v*w; // AbsoluteTiming // First // {"Cached", Times, #} &, v + w; // RepeatedTiming // First // {"Cached, rep", Plus, #} &, v*w; // RepeatedTiming // First // {"Cached, rep", Times, #} & }, Alignment -> {Left}]

Stack Exchange Network

Speeding up v.Reverse[v] or ListConvolve[v,v]

2 Answers 2

Hot Network Questions

Speeding up v.Reverse[v] or ListConvolve[v,v]

2 Answers 2

Related

Hot Network Questions