Floating Point Addition / Multiplication / Division

Question

I was doing some homework problems from my textbook and had a few questions on floating point rounding / precision for certain arithmetic operations.

If I have casted doubles from an int like so:

int x = random(); double dx = (double) x;

And let's say the variables y, z, dy, and dz follow the same format.

Then would operations like:

(dx + dy) + dz == dx + (dy + dz) (dx * dy) * dz == dx * (dy * dz)

be associative? I know that if we have fractional representations, then it would not be associative because some precision will be lost due to rounding depending on which operands add / multiply each other. However, since these are casted from ints, I feel like the precision would not be a problem and that these can be associative?

And lastly, the textbook I'm using does not explain FP division at all so I was wondering if this statement was true, or at least just how floating point division works in general:

dx / dx == dz / dz

I looked this up online and I read in some areas like an operation like 3/3 can yield .999...9 but there wasn't enough information to explain how that happened or if it would vary with other division operations.

A good compiler should recognize dx/dx and not actually emit division instructions. — Russell Borogove
– Russell Borogove, Commented May 7, 2015 at 2:27
You can exactly represent any value up to 2^53 + 1 as a double. Beyond that you run into rounding errors, even for integer types. stackoverflow.com/a/1848762/141172 — Eric J.
– Eric J., Commented May 7, 2015 at 2:30
you might remember, from your grade school days, that a number divided by itself is 1, so the comparison 'may' work. However, in general, floating point number should never be compared using '==' instead, get the absolute values, get the difference, the check for the difference being less than some threashold — user3629249
– user3629249, Commented May 7, 2015 at 3:28
(dx * dy) * dz == dx * (dy * dz) a problem if the precision of a double < twice precision of an int - which is often the case. (dx + dy) + dz == dx + (dy + dz) unlikely to be a problem as double precision certain more than int precision + 1. dx / dx == dz / dz obvious problem should dx==0 or dz==0 . — chux
– chux, Commented May 7, 2015 at 4:04
When dx * dy and dy * dz great than 2^53, it may be have precision issue. double dx = (double)(INT_MAX); double dy = (double)(INT_MAX - 0x111111); double dz = (double)(INT_MAX - 0xabcd); for (dx * dy) * dz == dx * (dy * dz) is false. — douyu
– douyu, Commented Dec 1, 2021 at 7:24

Yu Hao · Accepted Answer · 2015-05-07 02:32:04Z

Assuming int is at most 32-bit, and double follows IEEE-754. double can store integer value at most 2⁵³ precisely.

In the case of addition:

(dx + dy) + dz == dx + (dy + dz)

Both sides of == will have their precise values, so it is associative.

While in the case of multiplication:

(dx * dy) * dz == dx * (dy * dz)

It's possible that the value is over 2⁵³, so they are not guaranteed to be equal.

Just clarifying, so the reason the max a double can store is 2^53 is because of the 52-bit mantissa + the implied leading 1?
To be clear: double can store all integers exactly -pow(2,53) ... +pow(2,53) --inclusive - just like a 54 bit signed integer or int54_t.

Russell Borogove · Accepted Answer · 2015-05-07 02:36:37Z

You should understand that floating point numbers are typically internally represented as a sign bit, a fixed point mantissa (of 52 bits with an implied leading one for IEEE 64-bit doubles), and a binary exponent (11 bits for IEEE doubles). You can think of the exponent as the "quantum" of math units for a given value.

The addition should be associative if the sums all fit into the mantissa without the exponent going above 2⁰ == 1. If random() is producing 32-bit integers, a sum such as (dx + dy) + dz will fit, and the addition will be associative.

In the case of multiplication, it's easy to see that the product of 2 32-bit numbers may go well over 53 bits, so the exponent may need to go above 1 for the mantissa to contain the magnitude of the result, so associativity fails.

For division, in the particular case of dx / dx, the compiler may replace the expression with a constant 1.0 (perhaps after a zero check).

the exponent also has a built in offset (I'm thinking 256) to allow for both positive and negative exponents with out having to consume a bit for the sign
The exponent offset (usually called bias) is half the exponent range. For IEEE double with 11 bits of exponent, the bias is 1023. For IEEE single, it's 127.

Collectives™ on Stack Overflow

Floating Point Addition / Multiplication / Division

2 Answers 2

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Linked

Related