Re: Rounding of the double



When in doubt, trust the representation. If I get involved in details of " how many
digits precision", the first rule is that only the binary bits matter. Decimal
representations do not; they are only approximations of the true value which is stored. So
arguing about precision of binary floating point by showing decimal numbers is usually
suspect.

The formal specification of floating point precision is usually expressed as ±1 LSB (Least
Significant Bit). So if you have a representation like IEEE, which has an implied 1 in
the high-order position, you have nominally 1 additional bit of precision, but it is
always expressed in binary.

So rewriting the program below,

#include "stdafx.h"

typedef union {
double d;
unsigned __int64 i;
CString HexString() {
CString s;
s.Format(_T("%c%04I64u (%+04I64d) %013I64x"), i & 0x8000000000000000 ? _T('-') :
_T('+'),
i >> 52 & 0x00000000000007FF,
(((__int64)(i >> 52 & 0x00000000000007FF)) - 1023),
i & 0x000FFFFFFFFFFFFF);
return s;
}
} dint;
int main()
{
if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0))
{
// TODO: change error code to suit your needs
_tprintf(_T("Fatal Error: MFC initialization failed\n"));
return 1;
}

dint a1;
a1.d = 4e-15;
dint a2;
a2.d = 1 + a1.d;
dint a3;
a3.d = 1-a2.d;
bool a4 = 1 == a2.d;

printf("A1 = %+.20e %016I64X %s\n", a1.d, a1.i, a1.HexString());
printf("A2 = %+.20e %016I64X %s\n", a2.d, a2.i, a2.HexString());
printf("A3 = %+.20e %016I64X %s\n", a3.d, a3.i, a3.HexString());
printf("A4 = %s\n", (a4) ? "TRUE" : "FALSE");
return 0;
}

I get
A1 = +4.00000000000000030000e-015 3CF203AF9EE75616 +0975 (-048) 203af9ee75616
A2 = +1.00000000000000400000e+000 3FF0000000000012 +1023 (+000) 0000000000012
A3 = -3.99680288865056350000e-015 BCF2000000000000 -0975 (-048) 2000000000000
A4 = FALSE

That is a1 is 0x1203af9ee75616 x 2**-48
or
1.0010 0000 0011 1010 1111 1001 1110 1110 0111 0101 0110 0001 0110 x 2**-48

(53 bits of mantissa including the implied high-order 1-bit)

that is,
0.0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0010 0000 0011 1010 1111
1001 1110 0111 0101 0110 0001 0110

I leave the rest as Exercises For The Readers.
joe



On Sat, 02 Jun 2007 22:35:38 GMT, MrAsm <mrasm@xxxxxxx> wrote:

On Sat, 02 Jun 2007 12:59:35 -0500, "Doug Harrison [MVP]"
<dsh@xxxxxxxx> wrote:


#include <stdio.h>

int main()
{
double a1 = 4e-15;
double a2 = 1+a1;
double a3 = 1-a2;
bool a4 = 1 == a2;

printf("A1 = %.20e\n", a1);
printf("A2 = %.20e\n", a2);
printf("A3 = %.20e\n", a3);
printf("A4 = %s\n", (a4) ? "TRUE" : "FALSE");
}

The output I get is (VC8, cl a.cpp):


A1 = 4.00000000000000030000e-015
1 23456789012345xxxxxx
| |
10 15

The digits I put a 'x' under are *not* significant IMHO, in fact
IEEE754 double format has a precision of 15 digits.

e.g.:
http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html

And in fact you can see a spurious "3" in one of the 'x' positions
(and in fact you just wrote C++ code: a1=4e-15, so the "3" is actually
spurious).

A2 = 1.00000000000000400000e+000
1 23456789012345xxxxxx
| |
10 15

Now the 4 is spurious. In fact, the 4 is over the 15th significant
digit.


A4 = FALSE

bool a4 = 1 == a2;

But you get 'false' here because you are comparing floating point
numbers the *wrong* way, because IMHO you *can't* do operator== to
compare floating point numbers for equality, I believe that, I believe
that you can only do "fuzzy" compares with floating points, e.g.

|1 - a2| < tolerance
fabs( 1 - a2 ) < tolerance.

BTW: I would very much like to know also what David Webber (who <cite
url="http://www.mozart.co.uk/information/author/authinfo.htm";>is a
mathematician, theoretical physicist</cite>) thinks about that.
Thanks.

MrAsm
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.



Relevant Pages

  • Re: float bug? perl 5.8, DBI and oracle 10.2.0
    ... precision numbers in oracle, you've got 38 decimal digits to play ... and with minimal coaxing perl will handle them as ... digits from a 32 bit floating point number - I'll go out on a limb ... and hazard that one can expect 12 or so digits from a 64 bit floating ...
    (perl.dbi.users)
  • Re: Gentler Decimal Floating-Point
    ... effectively change the radix of a decimal floating point ... under the heading "Logarithms all the time", ... logarithmic representation of numbers to be made workable for all the ... The middle digits are shifted only when the exponent, ...
    (comp.arch.arithmetic)
  • Re: Linear Algebra Challenge
    ... Since I'm using floating point, so I'll never be able to calculate one ... floating point math set to 99 digits. ... As close as I'm willing to wait if I use arbitrary precision. ... This mode is fast; when you select arbitry ...
    (comp.sys.hp48)
  • Re: Interesting math
    ... Floating point number represents a real number with 6 digits precision. ... Floating point numbers are denoted by the keyword float. ...
    (alt.usage.english)
  • Re: float -> double
    ... Let us take a float with the hexadecimal representation 0x3f800001. ... When converting to double, we get another 29 bits of precision, which means we can theoretically convert this number to any one of 536,870,912 possible representable numbers in the range. ... The exact value could be printed if you really wanted to, but there's no real point since the implicit error in the number dwarfs most of the latter digits. ...
    (comp.lang.java.programmer)

Loading