1

Suppose I have some variable to double precision, x, that I read from a binary file. When printed this variable x = 384.6433096437924

I want to 'bump' it to quad precision.

integer, parameter :: qp = selected_real_kind(32,307) integer, parameter :: dp = selected_real_kind(15,307) real(kind=dp) :: x_double real(kind=qp) :: x_quad x_double = 384.6433096437924 x_quad = real(x_double, kind=dp) print *, x_double, x_quad, 384.64330964379240_qp 

This outputs,

384.6433096437924, 384.64330964379240681882947683334351, 384.643309643792400000000000000000009 

Which of the quad numbers is the 'best' - i.e. most accurate, consistent etc. value to use? The one with 'junk' or the one without?

I understand (I think!) the issues relating to the exact representation of a float on a computer, but cannot see which is the way to go.

I compile as gfortran -fdefault-real-8

1
  • 1
    Please check that your code is exact and that the results are copied exactly and tell us your compiler options. Please see the discussion below the answer. Your results are strange. See also stackoverflow.com/questions/16370206/… Commented May 22, 2018 at 15:37

1 Answer 1

1

The one with what you call “junk” has the least loss of information from the double value.

When 384.6433096437924 is converted to double-precision floating-point, the result is a value in a binary-based format, and that value is different from the original value. If you know the original value had 16 significant decimal digits, then printing the double with 16 decimal digits can produce the original value (in most cases), and that makes sense.

However, if the number has been through various computations, and there is no reason to believe it represents a number that has some exact number of decimal digits, then truncating or altering its value for display does not provide any useful information. It may make it easier for a human to read, but the entire computed value is the best information we have about the result, so printing the entire value conveys the most information.

Alternatively, if the original number can be converted directly to quad precision without going through double precision, then the error from the conversion will generally be less, and this method should be preferred.

Ultimately, which form is best for display depends on the purposes of the display. Do you want to preserve the most information for future use? Or do you want to make the result easily readable by a human?

Finally, the “junk” you show is strange. The result of converting 384.6433096437924 to IEEE-754 basic 64-bit binary floating-point should be 384.64330964379240640482748858630657196044921875, but you show 384.64330964379240681882947683334351. I do not see how that value arose.

Sign up to request clarification or add additional context in comments.

6 Comments

That value may have arosed because the literal 384.6433096437924 is single precision. stackoverflow.com/questions/6146005/… stackoverflow.com/questions/16370206/… But maybe OP's compiler uses double as the default.
@VladimirF: Hmm, IEEE-754 32-bit would have produced 384.643310546875. But there may be something going on with the selected_real_kind(15,307) that I do not understand. I know IEEE-754 reasonably well but have not followed FORTRAN for many years.
My results after executing OP's code in normally set gfortran are indeed 384.64331054687500 384.643310546875000000000000000000000 384.643309643792399999999999999999998
gfortran -fdefault-real-8 returns 384.64330964379241 384.643309643792406404827488586306572 384.643309643792399999999999999999998
@user1887919 That is a very important information, please edit your question and add it as I asked you before. Still, as you can see, I got somewhat different results. Please double-check everything and post the exact compiler options, the exact code and the compiler version. Again, as I asked you to do already.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.