Convert from binary to floating point

Question

I'm doing some exercises for Computer Science university and one of them is about converting an int array of 64 bit into it's double-precision floating point value.

Understanding the first bit, the sign +/-, is quite easy. Same for the exponent, as well as we know that the bias is 1023.

We are having problems with the significand. How can I calculate it?

In the end, I would like to obtain the real numbers that the bits meant.

If you could provide some example inputs and the expected outputs, this could help to reduce ambiguity (for example, I am not entirely clear on the exact nature of your input data). — NPE
– NPE, Commented May 17, 2012 at 12:23
the input data is just an array of bits populated like 1 0 1 0 ... ad so on for 64 bits. now i have to convert this array of bits in the floating number that have the corresponding binary sequence. — wiredmark
– wiredmark, Commented May 17, 2012 at 12:27

arthur · Accepted Answer · 2012-05-17 13:42:43Z

computing the significand of the given 64 bit is quite easy.

according to the wiki article using the IEEE 754, the significand is made up the first 53 bits (from bit 0 to bit 52). Now if you want to convert number having like 67 bits to your 64 bits value, it would be rounded by setting the trailing 64th bits of your value to 1, even if it was one before... because of the other 3 bits:

11110000 11110010 11111 becomes 11110000 11110011 after the rounding of the last byte;

therefore the there is no need to store the 53th bits because it has always a value a one. that's why you only store in 52 bits in the significand instead of 53.

now to compute it, you just need to target the bit range of the significand [bit(1) - bit(52)] -bit(0) is always 1- and use it .

int index_signf = 1; // starting at 1, not 0 int significand_length = 52; int byteArray[53]; // array containing the bits of the significand double significand_endValue = 0; for( ; index_signf <= significand_length ; index_signf ++) { significand_endValue += byteArray[index_signf] * (pow(2,-(index_signf))); } significand_endValue += 1;

Now you just have to fill byteArray accordlingly before computing it, using function like that:

int* getSignificandBits(int* array64bits){ //returned array int significandBitsArray[53]; // indexes++ int i_array64bits = 0; int i_significandBitsArray=1; //set the first bit = 1 significandBitsArray[0] = 1; // fill it for(i_significandBitsArray=1, i_array64bits = (63 - 1); i_array64bits >= (64 - 52); i_array64bits--, i_significandBitsArray ++) significandBitsArray[i_significandBitsArray] = array64bits[i_array64bits]; return significandBitsArray; }

Dervall · Accepted Answer · 2012-05-17 12:38:21Z

2

You could just load the bits into an unsigned integer of the same size as a double, take the address of that and cast it to a void* which you then cast to a double* and dereference.

Of course, this might be "cheating" if you really are supposed to parse the floating point standard, but this is how I would have solved the problem given the parameters you've stated so far.

answered May 17, 2012 at 12:38

Dervall

5,7343 gold badges28 silver badges48 bronze badges

4 Comments

wiredmark Over a year ago

i can't do that because i have an array of bits with each position occupied by 0 or 1. Now i read it as a string of bits but my compiler don't. i was trying to convert it using the position to rappresent the exponent and the sign but for the remaining bits of the significand i don't know how to convert them.

Dervall Over a year ago

You can do that if you have the bit array. Just declare a long integer, and use the | and << operators to push the bit into it's correct place.

Paul Hankin Over a year ago

Excuse my ignorance, but what's a bit array?

Dervall Over a year ago

There is really no such thing, he must have something like an array of integers with the values 0 and 1, or bytes or chars or whatever. Some array to represent either on or off states.

Paul Hankin · Accepted Answer · 2012-05-17 12:56:02Z

If you have a byte representation of an object you can copy the bytes into the storage of a variable of the right type to convert it.

double convert_to_double(uint64_t x) { double result; mempcy(&result, &x, sizeof(x)); return result; }

You will often see code like *(double *)&x to do the conversion, but whereas in practice this will always work it's undefined behavior in C.

Collectives™ on Stack Overflow

Convert from binary to floating point

3 Answers 3

Comments

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Related