Fast inverse square of double in C/C++

Question

Recently I was profiling a program in which the hotspot is definitely this

double d = somevalue(); double d2=d*d; double c = 1.0/d2 // HOT SPOT

The value d2 is not used after because I only need value c. Some time ago I've read about the Carmack method of fast inverse square root, this is obviously not the case but I'm wondering if a similar algorithms can help me computing 1/x^2.

I need quite accurate precision, I've checked that my program doesn't give correct results with gcc -ffast-math option. (g++-4.5)

@SigTerm: The compiler is already capable of doing that step for you, doing simple substitutions almost never has an impact on modern compilers. (A modern compiler will typically translate both into the same IR, more or less.) — Dietrich Epp
– Dietrich Epp, Commented Mar 16, 2012 at 12:06
@scibuff: The pow function is going to be slower than division. Actually, the exact same assembly code will be emitted for 1.0/(x*x) and pow(x, -2) when GCC uses -ffast-math. — Dietrich Epp
– Dietrich Epp, Commented Mar 16, 2012 at 12:09
It's all very nice, but has anyone even noticed the question isn't about inverse square roots but just about inverse squares? — user555045
– user555045, Commented Mar 16, 2012 at 12:18
The computation is inside a double nested for loop and it takes 50% of time. That double nested loops is the core of the program and it takes 80% of it (it's a pairwise potential field computation). — linello
– linello, Commented Mar 16, 2012 at 12:27

Dietrich Epp · Accepted Answer · 2012-03-16 13:41:19Z

The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)

Are you sure you need double precision? You can sacrifice precision easily enough:
```
double d = somevalue(); float c = 1.0f / ((float) d * (float) d); 
```
The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.
Have you tried enabling "sloppy" math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. (Edit: I did not see any difference in the resulting assembly.)
If you are using GCC, have you considered using -mrecip? There is a "reciprocal estimate" function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) (Edit: I was unable to get GCC to generate RCPSS. See below.)

I found a blog post (source) discussing the exact problem you are going through, and the author's conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).

The reason why division can be so slow is because processors generally only have one division unit and it's often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.

Tricks that don't work

Carmack's method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I've seen only gives one bit of precision -- nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.
Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.
Using pow. The pow library function is a total monster. With GCC's -ffast-math turned off, the library call is fairly expensive. With GCC's -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.

Update

Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.

static double invsq(double x) { double y; int i; __asm__ ( "cvtpd2ps %1, %0\n\t" "rcpss %0, %0\n\t" "cvtps2pd %0, %0" : "=x"(y) : "x"(x)); for (i = 0; i < RECIP_ITER; ++i) y *= 2 - x * y; return y * y; }

Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It's faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don't know if 12 bits is enough for you.

I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.

For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don't know.)

I suspect the RCPSS method shines if you want to compute several of these in parallel.

Thanks, I noticed that pow function is very low, it's definitely on of the hotspots I'm removing from my code (when possible). It seems that -mrecip doesn't any optimization (I just enabled -O3)
Have you profiled that? I did actually write a little tester for Carmack's method comparing pow(x, -0.5f), 1.0f/sqrt(x) and Carmack's method, and the latter is always faster by quite a lot. With -ffast-math, the first two are equally fast, and without, pow is the slowest. (Using GCC 4.6.3, -O3 and -march=native.)
Yes my code is profile through cachegrind and visualized with Kcachegrind. is there that I saw the hotspot, but I didn't already convert my double to float before applying -mrecip. the problem is that here using ffast-math my results are very inaccurate.
@linello: Sorry, I was talking to Dietrich Epp, the post's author.
@KerrekSB: I'm trying to say that Carmack's method, while faster than naive methods like pow and 1.0/sqrt(x), is (1) slower than using a reciprocal estimate opcode for a given level of accuracy and (2) less useful for computing 1.0/(x*x) than 1.0/sqrt(x).

Kerrek SB · Accepted Answer · 2012-03-16 12:09:46Z

Yes, you can certainly try and work something out. Let me just give you some general ideas, you can fill in the details.

First, let's see why Carmack's root works:

We write x = M × 2^E in the usual way. Now recall that the IEEE float stores the exponent offset by a bias: If e denoted the exponent field, we have e = Bias + E ≥ 0. Rearranging, we get E = e − Bias.

Now for the inverse square root: x^−1/2 = M^-1/2 × 2^−E/2. The new exponent field is:

e' = Bias − E/2 = 3/2 Bias − e/2

With bit fiddling, we can get the value e/2 from e by shifting, and 3/2 Bias is just a constant.

Moreover, the mantissa M is stored as 1.0 + x with x < 1, and we can approximate M^-1/2 as 1 + x/2. Again, the fact that only x is stored in binary means that we get the division by two by simple bit shifting.

Now we look at x⁻²: this is equal to M⁻² × 2^{−2 E}, and we are looking for an exponent field:

e' = Bias − 2 E = 3 Bias − 2 e

Again, 3 Bias is just a constant, and you can get 2 e from e by bitshifting. As for the mantissa, you can approximate (1 + x)⁻² by 1 − 2 x, and so the problem reduces to obtaining 2 x from x.

Note that Carmack's magic floating point fiddling doesn't actually compute the result right aaway: Rather, it produces a remarkably accurate estimate, which is used as the starting point for a traditional, iterative computation. But because the estimate is so good, you only need very few rounds of subsequent iteration to get an acceptable result.

mcdowella · Accepted Answer · 2012-03-16 18:21:34Z

For your current program you have identified the hotspot - good. As an alternative to speeding up 1/d^2, you have the option of changing the program so that it does not compute 1/d^2 so often. Can you hoist it out of an inner loop? For how many different values of d do you compute 1/d^2? Could you pre-compute all the values you need and then look up the results? This is a bit cumbersome for 1/d^2, but if 1/d^2 is part of some larger chunk of code, it might be worthwhile applying this trick to that. You say that if you lower the precision, you don't get good enough answers. Is there any way you can rephrase the code, that might provide better behaviour? Numerical analysis is subtle enough that it might be worth trying a few things and seeing what happened.

Ideally, of course, you would find some optimised routine that draws on years of research - is there anything in lapack or linpack that you could link to?

Thank for your very clever answer, but unfortunately the 1/d^2 has to be computed and updated and updated inside the inner loop, so no further simple optimization can be done. The range of values d takes is then far too big to precompute in a simple way.

Collectives™ on Stack Overflow

Fast inverse square of double in C/C++

3 Answers 3

Tricks that don't work

Update

9 Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Tricks that don't work

Update

9 Comments

Comments

1 Comment

Linked

Related