Floating point reciprocal approximation using softfloat library

Question

I am using the softfloat library (http://www.jhauser.us/arithmetic/SoftFloat.html) to implement a single precision division algorithm. I am trying to understand the reciprocal approximation function implemented as part of the softfloat library. Please see below the code. Could anyone explain how they came up with the LUT? It looks like a combination of LUT and NR approximations, but a detailed explanation would definitely help.

/* Returns an approximation to the reciprocal of the number represented by `a', where `a' is interpreted as an unsigned fixed-point number with one integer bit and 31 fraction bits. The `a' input must be "normalized", meaning that its most-significant bit (bit 31) must be 1. Thus, if A is the value of the fixed-point interpretation of `a', then 1 <= A < 2. The returned value is interpreted as a pure unsigned fraction, having no integer bits and 32 fraction bits. The approximation returned is never greater than the true reciprocal 1/A, and it differs from the true reciprocal by at most 2.006 ulp (units in the last place). */ uint32_t softfloat_approxRecip32_1( uint32_t a ) { int index; uint16_t eps; static const uint16_t k0s[] = { 0xFFC4, 0xF0BE, 0xE363, 0xD76F, 0xCCAD, 0xC2F0, 0xBA16, 0xB201, 0xAA97, 0xA3C6, 0x9D7A, 0x97A6, 0x923C, 0x8D32, 0x887E, 0x8417 }; static const uint16_t k1s[] = { 0xF0F1, 0xD62C, 0xBFA1, 0xAC77, 0x9C0A, 0x8DDB, 0x8185, 0x76BA, 0x6D3B, 0x64D4, 0x5D5C, 0x56B1, 0x50B6, 0x4B55, 0x4679, 0x4211 }; uint16_t r0; uint32_t delta0; uint_fast32_t r; uint32_t sqrDelta0; index = a>>27 & 0xF; eps = (uint16_t) (a>>11); r0 = k0s[index] - ((k1s[index] * (uint_fast32_t) eps)>>20); delta0 = ~(uint_fast32_t) ((r0 * (uint_fast64_t) a)>>7); r = ((uint_fast32_t) r0<<16) + ((r0 * (uint_fast64_t) delta0)>>24); sqrDelta0 = ((uint_fast64_t) delta0 * delta0)>>32; r += ((uint32_t) r * (uint_fast64_t) sqrDelta0)>>48; return r; }

What are the input and output specifications for softfloat_approxRecip32_1()? For an input of 0x40000000 I get an output of 0xab0f9c81 which doesn't make much sense to me in the context of a reciprocal computation (assuming either fixed-point or floating-point computation). That said, the usage of the two tables seems to suggest that piecewise linear approximations are used. — njuffa
– njuffa, Commented Sep 16, 2015 at 21:06
Plugging in a bunch of values it seems like the input is an unsigned 1.31 fixed-point number, and the result is a 0.32 fixed-point number, that is, a purely fractional number. E.g. input of 0xc0000000 (= 3/2) leads to output of 0xaaaaaaaa (= 2/3) My limited experiments suggest that the functions provides an underestimate of the true mathematical result, in that all results appear to be less than, or equal to, the mathematical result. — njuffa
– njuffa, Commented Sep 16, 2015 at 21:16
I am more interested on how they arrived at the LUT and internal scale factors. — vtr
– vtr, Commented Sep 16, 2015 at 21:47
Reverse engineering further, the k0s and k1s are slope and intersect of linear approximations, each stored in 0.16 fixed-point format, that is, purely fractional values. This gives r0. From r0 it computes rwith one NR iteration r0 + r0 * (1 - r0 * a), with delta0 = (1 - r0 * a). Not sure where the sqrDelta0 comes in, it could be for the next term in a series expansion? — njuffa
– njuffa, Commented Sep 16, 2015 at 21:53

njuffa · Accepted Answer · 2015-09-18 20:40:15Z

The initial approximation r0 is computed by piece-wise linear approximation, using sixteen intervals, from [1, 17/16) to [15/16, 2), selected by the four most significant fractional bits of the 1.31 fixed-point argument. The initial estimate is then refined using the generalized Newton iteration for the reciprocal r_new = r_old + r_old * (1 - a * r_old) + r_old * (1 - a * r_old)² + ... + r_old * (1 - a * r_old)^k [see paper by Liddicoat and Flynn]. delta0 is (1 - a * r₀). The first three terms of the expansion are used: r = r₀ + r₀ * delta₀ + r₀ * delta₀². This iteration has cubic convergence, tripling the number of correct bits in every iteration. In this implementation, worst case relative error in r0 is about 9.44e-4, while the worst case relative error in the final result r is about 9.32e-10.

The scale factors in the fixed-point computation are chosen to maximize the accuracy of intermediate computations (by retaining as many bits as possible), and to have the fixed-point conveniently fall on a word boundary, as in the computation of delta0 in which the 1 therefore can be omitted.

The code requires delta0 to be a positive quantity, therefore r0 must always be an underestimation of the mathematical result 1/a. The linear approximation for each interval therefore cannot be minimax approximations. Instead the slope between the function values 1/a of the endpoints of each interval is computed, and the absolute value of that scaled by 2¹⁶ is stored in k0s, meaning the array elements are 0.16 fixed-point numbers. Starting at the function value for the midpoint of each interval, the slope is then applied to find the intercept for the left endpoint of each interval. This value is likewise scaled by 2¹⁶ and stored in k1s which therefore also holds 0.16 fixed-point numbers.

Based on my analysis, it seems that rounding towards 0 is employed in the floating-point to fixed-point conversion for entries in k0s while rounding towards positive infinity is employed in the floating-point to fixed-point conversion for entries in k1s. The following program implements the algorithm outlined above and produces table entries identical to those used in the code in the question.

#include <stdlib.h> #include <stdio.h> int main (void) { printf ("interval k0 k1\n"); for (int i = 0; i < 16; i++) { double x0 = 1.0+i/16.0; // left endpoint of interval double x1 = 1.0+(i+1)/16.0; // right endpoint of interval double f0 = 1.0 / x0; double f1 = 1.0 / x1; double df = f0 - f1; double sl = df * 16.0; // slope across interval double mp = (x0 + x1) / 2.0; // midpoint of interval double fm = 1.0 / mp; double ic = fm + df / 2.0; // intercept at start of interval printf ("%5d %04x %04x\n", i, (int)(ic * 65536.0 - 0.9999), (int)(sl * 65536.0 + 0.9999)); } return EXIT_SUCCESS; }

The output of the above program should be as follows:

interval k0 k1 0 ffc4 f0f1 1 f0be d62c 2 e363 bfa1 3 d76f ac77 4 ccad 9c0a 5 c2f0 8ddb 6 ba16 8185 7 b201 76ba 8 aa97 6d3b 9 a3c6 64d4 10 9d7a 5d5c 11 97a6 56b1 12 923c 50b6 13 8d32 4b55 14 887e 4679 15 8417 4211

thanks. Do you know how the required precision is achieved? The input is a 24 bit significand scaled to 32 bits. kth order approximation would give (k+1)x factor precision. Also, how is error in the result not more than 2.006 ulp (as shown in the comment)?
worst case relative error in r0 is about 9.44e-4, while the worst case relative error in the final result r is about 9.32e-10. Please explain. Thanks
I am not sure what you are asking. Relative error is define as: (result - reference) / reference. When assessing the worst case error of an approximation one is interested only in the magnitude of the number, so compute abs ((result -reference) / reference) for all possible inputs in [0x80000000, 0xffffffff] and determine the maximum of these values, giving the numbers in my answer. I used double-precision computation of the reciprocal as the reference: 1.0 / (a * exp2 (-31.0)).
Please see below for scaling in the code above, it looks like the decimal points are not aligned. Could you check? format: {bits before decimal point, bits after decimal point} r0 = {1,15}; delta0 = {32}; (r0x(uint64_t)delta0)>>24 = {1,15}x{15,39}>>24 = {16,54}>>24 = {34,30}; r0<<16 + r0x(uint64_t)delta0 = {1,31} + {34,30} // decimal point not aligned??
Please check again. r0 does not use 1.15 representation, but 0.16 representation. The input a is in [1,2) so the reciprocal r0 must be in (0.5,1] but the approximation always underestimates meaning it is strictly less than 1. Therefore r0 is a purely fractional number.

Collectives™ on Stack Overflow

Floating point reciprocal approximation using softfloat library

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related