Removing slow int64 division from fixed point atan2() approximation

Question

I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?

I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.

Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:

int32_t fpatan2(int32_t y, int32_t x) // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point { #include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp) const uint32_t outfmt = 32; // final output format in s0.outfmt const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp); // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t const uint32_t halfof = 1UL<<(outfmt-1); // represents 0.5 in the output format, which since it is in turns means half a circle const uint32_t pds=ds-lutsp; // division shift and post-division shift uint32_t lutind, p, t, d; int32_t a, b, xa, ya, xs, ys, div, r; xs = x >> 31; // equivalent of fabs() xa = (x^xs) - xs; ys = y >> 31; ya = (y^ys) - ys; d = ya+xa; if (d==0) // if both y and x are 0 then they add up to 0 and we must return 0 return 0; // the following does 0.5 * (1. - (y-x) / (y+x)) // (y+x) is u1.31, (y-x) is s0.31, div is in s1.29 div = ((int64_t) (ya-xa)<<ds) / d; // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds p = ((1UL<<ds) - div) >> 1; // before shift the format is s2.29. position in u1.29 lutind = p >> ish; // index for the LUT t = (p & tmask) >> tbd; // interpolator between two LUT entries a = fpatan_lut[lutind]; b = fpatan_lut[lutind+1]; r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip); // linear interpolation of a and b by t in s0.32 format // Quadrants if (xs) // if x was negative r = halfof - r; // r = 0.5 - r r = (r^ys) - ys; // if y was negative then r is negated return r; }

How did you determine the time taken by the division? That seems unexpectedly high. — wallyk
– wallyk, Commented Aug 30, 2014 at 17:36
For div = ((int64_t) (ya-xa)<<ds) / d, is it necessary to force it to int64? Otherwise it would be evaluated as int32 in presumably 25% of the time. — wallyk
– wallyk, Commented Aug 30, 2014 at 17:39
@MichelRouzic Agner Fog's tables. instlatx64.atw.hu is more complete. — user555045
– user555045, Commented Aug 30, 2014 at 18:26
Have you considered implementing division as reciprocation+multiply? You have the LZCNT instruction to normalize the divisor to the range [1, 2); You can then use a polynomial approximation to produce a reciprocal estimate, then you can use a few quadratic- or cubic-convergence Newton-Raphson iterations to increase your precision to full 64-bits. This might actually be faster depending on how well you can implement it. — Iwillnotexist Idonotexist
– Iwillnotexist Idonotexist, Commented Aug 30, 2014 at 18:43
@MichelRouzic I recently implemented integer division on a VLIW vector processor using only 8x8-bit multiplies. I'm not exactly at liberty to share the code, but for the initial estimate I used hal.archives-ouvertes.fr/docs/00/15/33/70/PDF/newcas07.pdf and the Newton-Raphson setup for quadratic and cubic convergence on the reciprocal given an estimate are in kestrel.soe.ucsc.edu/papers/new2.pdf. The initial approximation given in the Kestrel paper is cute (5 bits accurate), but I've found more value in using the 13-bit-accurate initial estimate above, avoiding two extra NR steps. — Iwillnotexist Idonotexist
– Iwillnotexist Idonotexist, Commented Aug 30, 2014 at 19:30

Ross Ridge · Accepted Answer · 2014-08-30 18:35:30Z

Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.

If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.

Thanks! Just for the record the whole function HAS to be fixed point with no floating point, regardless of performance.

Michel Rouzic · Accepted Answer · 2014-09-10 12:33:42Z

Based on @Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.

I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.

Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:

#include "ffo_lut.h" static INLINE int32_t log2_ffo32(uint32_t x) // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1) { int32_t y; y = x>>21; if (y) return ffo_lut[y]+21; y = x>>10; if (y) return ffo_lut[y]+10; return ffo_lut[x]; } // Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y // The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt) // ~23.3 cycles, max error (by division) 53.39e-9 { #include "fpdiv.h" // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp) const int32_t *c; int32_t xa, xs, p, sh; uint32_t expon, frx, lutind; const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1); // the shift for the index, the shift for 31-bit xa, the value of 0.5 int64_t out; int64_t c0, c1, c2; // turn x into xa (|x|) and sign of x (xs) xs = x >> 31; xa = (x^xs) - xs; // decompose |x| into frx * 2^expon expon = log2_ffo32(xa); frx = (xa << (31-expon)) >> cfs; // the fractional part is now in 0.27 format // lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result lutind = (frx - half) >> ish; // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits lutind *= 3; // 3 entries for each index c = &fpdiv_lut[lutind]; // c points to the correct c0, c1, c2 c0 = c[0]; c1 = c[1]; c2 = c[2]; p = (int64_t) frx * frx >> prec; // x^2 p = c2 * p >> prec; // c2 * x^2 p += c1 * frx >> prec; // + c1 * x p += c0; // + c0, p = (1.0 , 2.0] in 2.27 format // apply the necessary bit shifts and reapplies the original sign of x to make final result sh = expon + prec - outfmt; // calculates the final needed shift out = (int64_t) y * p; // format is s31 + 1.27 = s32.27 if (sh >= 0) out >>= sh; else out <<= -sh; out = (out^xs) - xs; // if x was negative then out is negated return out; }

I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.

As for the fpatan2() question the solution would be to replace this line:

div = ((int64_t) (ya-xa)<<ds) / d;

with that line:

div = fpdiv(ya-xa, d, ds);

Did you try implementing Netch's suggestion to get the compiler to do a 64b/32b => 32b division? Agner Fog's tables say idiv r32 on AMD Piledriver is still 13-40c latency, rather than 13-71c. But on Skylake, idiv r32 is 26c latency, one per 6c throughput. Haswell has not quite as good throughput, but still partially pipelined.

phuclv · Accepted Answer · 2016-04-11 14:01:56Z

Yours time hog instruction:

div = ((int64_t) (ya-xa)<<ds) / d;

exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:

 21: 48 89 c2 mov %rax,%rdx 24: 48 c1 fa 3f sar $0x3f,%rdx ## this is sign bit extension 28: 48 f7 fe idiv %rsi

Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:

uint64_t tmp_num = ((int64_t) (ya-xa))<<ds; asm("idivl %[d]" : [a] "=a" (div1) : "[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) : "cc");

this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:

 18: 48 98 cltq 1a: 48 c1 e0 1d shl $0x1d,%rax 1e: 48 89 c2 mov %rax,%rdx 21: 48 c1 ea 20 shr $0x20,%rdx 27: f7 f9 idiv %ecx

This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.

More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:

uint32_t diff = ya - xa; uint32_t lowpart = diff << 29; uint32_t highpart = diff >> 3; asm("idivl %[d]" : [a] "=a" (div1) : "[a]" (lowpart), "d" (highpart), [d] "q" (d) : "cc");

which results in:

 18: 89 d0 mov %edx,%eax 1a: c1 e0 1d shl $0x1d,%eax 1d: c1 ea 03 shr $0x3,%edx 22: f7 f9 idiv %ecx

but this is minor fix, compared to the division-related one.

To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.

Clifford · Accepted Answer · 2014-08-30 18:11:09Z

Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.

Assuming constants defined for pi abd pi/2:

int32_t fpatan2( int32_t y, int32_t x) { fixed theta ;

if( x == 0 ) { theta = y > 0 ? fixed_half_pi : -fixed_half_pi ; } else { theta = fpatan( y / x ) ; if( x < 0 ) { theta += ( y < 0 ) ? -fixed_pi : fixed_pi ; } } return theta ;

}

Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.

I think the y / x would mean that it's it's not going to perform any better because it's still performing an integer division.
Clifford y / x isn't going to work in fixed point, just imagine if x isn't dwarfed by y, the result of that division is going to be a single digit integer. Also all your approach does is deal with quadrants, my code already does that in a fast and branchless way :). As for getting the implementation right it isn't so hard, you can test your function billions of time against atan2() which I did (I get a max error of 0.07667 arc seconds with a 4 kB LUT).

Collectives™ on Stack Overflow

Removing slow int64 division from fixed point atan2() approximation

4 Answers 4

1 Comment

2 Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

Comments

2 Comments

Linked

Related