32-bit fixed point arithmetic 1x1 does not equal 1

Question

I'm implementing 32-bit signed integer fixed-point arithmetic. The scale is from 1 to -1, with INT32_MAX corresponding to 1. I'm not sure whether to make INT32_MIN or -INT32_MAX correspond to -1, but that's an aside for now.

I've made some operations to multiply and round, as follows:

#define mul(a, b) ((int64_t)(a) * (b))

#define round(x) (int32_t)((x + (1 << 30)) >> 31)

The product of two numbers can then be found using round(mul(a, b)).

The issue comes up when I check the identity. The main problem is that 1x1 is not 1. It's INT32_MAX-1. That's obviously not desired as I would like bit-accuracy. I suppose this would affect other nearby numbers so the fix isn't a case of just adding 1 if the operands are both INT32_MAX. Additionally, -1x-1 is not -1, 1x-1 is not -1, and -1x-1=-1. So none of the identities hold up.

Is there a simple fix to this, or is this just a symptom of using fixed point arithmetic?

x + (1 << 30) is very likely to cause an integer overflow bug. How do you protect against that? — Lundin
– Lundin, Commented Jun 17, 2021 at 6:45
" I'm not sure whether to make INT32_MIN or -INT32_MAX correspond to -1". It must be -INT32_MAX otherwise you will have a different 'scale' for positive and negative numbers and won't be able to do any operations that involve a positive and a negative number. — Weather Vane
– Weather Vane, Commented Jun 17, 2021 at 6:55
In your 1x1 example after multiplying 2147483647 * 2147483647 you need to divide by 2147483647 to realign. You can't do that by shifting. You can only to that if the 'scale' is a power of 2, so that 1 is represented by, for example, 2**30. Then the product of two positive vlaues could be ((int64_t)a * b) >> 30 with some rounding before shifting if you like. — Weather Vane
– Weather Vane, Commented Jun 17, 2021 at 6:59
The simple fix is to use a scale factor that is a power of 2. (Weather Vane already pointed that out, but it bears repeating, with emphasis.) So you have two choices: 1) Accept a range of [-1, +1) using 2**31 as the scale factor, or 2) use a range of [-2, +2) with a scale factor of 2**30. Those are half-open intervals as defined here. — user3386109
– user3386109, Commented Jun 17, 2021 at 8:06
I'd agree with others (that the scale should be a power of 2); but I'd go one step further and say that it is NOT fixed point maths unless the scale is a power of 2. The entire idea (and definition of) "fixed point" is that some bits are before the decimal point and some are after the decimal point, and the decimal point is at a fixed position between these groups of bits. — Brendan
– Brendan, Commented Jun 17, 2021 at 8:35

Eric Postpischil · Accepted Answer · 2021-06-17 16:38:05Z

In its general form, a fixed-point format represents a number x as an integer x•s. Commonly, s is a power of some base b, s = b^p. For example, we might store a number of dollars x as x•100, so $3.45 might be stored as 345. Here we can easily see why this is called a “fixed-point” format: The stored number conceptually has decimal point inserted as a fixed position, in this case two to the left of the rightmost digit: “345” is conceptually “3.45”. (This may also be called a radix point rather than a decimal point, allowing for cases when the base b is not ten. And p specifies where the radix-point is inserted, p base-b digits from the right.)

If you make INT_MAX represent 1, then you are implicitly saying s = INT_MAX. (And, since INT_MAX is not a power of any other integer, we have b = INT_MAX and p = 1.) Then −1 must be represented by −1•INT_MAX = -INT_MAX. It would not be represented by INT_MIN (except in archaic C implementations where INT_MIN = -INT_MAX).

Given s = INT_MAX, shifting by 31 bits is not a correct way to implementation multiplication. Given two numbers x and y with representations a and b, the representation of xy is computed by multiplying the representations a and b and dividing by s:

a represents x, so a = xs.
b represents y, so b = ys.
Then ab/s = xsys/s = xys, and xys represents xy.

Shifting by 31 divides by 2³¹, so that is not the same as dividing by INT_MAX. Also, division is generally slow in hardware. You may be better off choosing s = 2³⁰ instead of INT_MAX. Then you could shift by 30 bits.

When calculating ab/s, we often want to round. Adding ½s to the product before dividing is one method of rounding, but it is likely not what you want for negative products. You may want to consider adding −½s if the product is negative.

Since an arithmetic right shift is equivalent to division by 2**n with rounding to negative infinity, a round-to-nearest fixed-point multiplication would in fact always add 2**(n-1) prior to the right shift. This assumes that C's right-shift operator, when applied to a signed integer type, results in an arithmetic right shift, which is not guaranteed by the ISO-C standard but is the case for all C compilers I am familiar with.

Collectives™ on Stack Overflow

32-bit fixed point arithmetic 1x1 does not equal 1

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related