Go to Google Groups Home
                     Groups
  Advanced Groups Search    Preferences    Groups Help
  _______________________________ Google Search
  "of" is a very common word and was not included in your search.
  [details]

  Groups search result 4 for +IEEE +"negative powers of two"

   Source Navigation Tools  o  Navigate your C/C++/Ada/Fortran rapidly
  and easily.  o  www.scitools.com Sponsored Links
   MS Visual C++ 6 Pro Sale  o  Buy Microsoft Visual C++6 Pro,other new
  software releases at Discount.  o  http://www.discount-software.ws/
   SourceStyler C++  o  An advanced C/C++ code formatter Download a free
  trial today!  o  www.sourcestyler.com

                                                         Search Result 4

  From: Bret Halford ([email protected])
  Subject: Re: float data type

                                    View: Complete Thread (10 articles)
                                                         Original Format

  Newsgroups:
  comp.databases.sybase
  Date: 2000/05/31

The floating point datatype was designed to be able to hold a wide range of
values and allow fairly rapid arithmatic operations on them, at the expense
of absolute accuracy.

The exact nature of the imprecision inherent in floating point
datatypes is a source of much  confusion to many people.  This paper attempts
to explain that nature of the imprecision.  Some aspects of actual floating
point implementation have been simplified or ignored (such as the final
two's-complement representation).

It should be noted that Sybase did not develop the floating point datatype;
it is a widely used IEEE standard.  C or C++ programs on the same platform as
SQL Server will demonstrate the similar floating point behavior [see Question
5 in the Q&A section below].  There are two common standard types of floating
point numbers: 4-byte reals and 8-byte doubles.

Reals and doubles store values in a similar format: 1 sign bit, <x> exponent
bits, and <y> mantissa bits.  The only difference is that reals use smaller
exponents and mantissas.

-------------------------------------------------------------------------
       According to the IEEE standard for floating point
Real datatypes  have a 23 bit mantissa* and a 9 bit exponent (total 32).
Double datatypes have a 53 bit mantissa* and an 11 bit exponent (total 64)

Some platforms use different standards.  A double on a VAX, for instance, uses
a 56-bit mantissa*, 8-bit exponent, and 1 sign bit (total 64)

*The right-most bit is implicit.  It is not actually stored as it is always on.
-------------------------------------------------------------------------

The mantissa is a binary representation of the number, each bit representing
a power of two.  There is an additional implicit bit at the beginning
(right hand side) of the number, which is always on.

The exponent indicates a power of two that is used to multiply (or shift)
the mantissa to represent larger or smaller values.  The first bit of the
mantissa represents the value 2^<exponent>, the second bit 2^<exponent-1>, etc.

After the mantissa bits needed to represent the whole number part of the
number have been used, the fractional part of the number is represented with
the remaining bits (if any), which have values of negative powers of two..

For the sake of a simple demonstration, imagine an even smaller floating
point format with a 12 bits [including one implicit] mantissa, one sign bit,
and 4  exponent bits (total 16).

#  [1]###########  ####
^   ^ ^            ^
|   | |            Exponent
|   | Mantissa
|   Virtual bit of Mantissa (always on)
Sign bit

To represent a number determine "is the number positive or negative?".
Set the sign bit if it is negative.  Then determine "what is the smallest power

of 2 that is larger than the number?".  Subtract one from that power to
find the exponent of the implicit bit in the mantissa.  Store that
exponent in the exponent field.  Subtract the value of the implicit bit
(2^exponent) from the number.  Determine if 2^(exponent-1) is larger than the
remainder.  If so, set the next bit, subtract the value of that bit
from the number, and determine if 2^(exponent-2) is larger than the remainder.
Repeat until you run out of mantissa bits.


For instance, to represent 123.3:

The number is positive, so the sign bit is set to 0:

0  [1]###########  ####

The smallest power of 2 that is larger than 123.3 is 128, or 2^7,
so the exponent is 7-1, or 6.  2^6 is 64, the value of the implicit mantissa bi
t

0  [1]###########  0110

123.3 - 64 is 59.3.  2^5 is 32, which is smaller than 59.3, so the next bit is
set

0  [1]1########## 0110

59.3-32 = 27.3.  2^4 is 16, which is smaller than 27.3, so the next bit is set

0  [1]11######### 0110

27.3 - 16 = 11.3.  2^3 is 8, which is smaller than 11.3, so the next bit is set

0  [1]111######## 0110

11.3 - 8 = 3.3.  2^2 is 4, which is larger than 3.3, so the next bit is not set

0  [1]1110####### 0110

3.3 - 0 = 3.3.  @^1 is 2, which is smaller than 3.3, so the next bit is set

0  [1]11101###### 0110

3.3 - 2 = 1.3.  2^0 is 1, which is smaller than 1.3, so the next bit is set

0  [1]111011##### 0110

1.3 - 1 = 0.3.  2^-1 is 0.5, which is larger than 0.3, so the next bit is not s
et

0  [1]1110110#### 0110

0.3 - 0 = 0.3.  2^-2 is 0.25, which is smaller than 0.3, so the next bit is set

0  [1]11101101### 0110

0.3 - 0.25 = 0.05.  2^-3 is 0.125, which is larger than 0.05, so the next bit i
s not set

0  [1]111011010## 0110

0.05 - 0 = 0.05.  2^-4 is 0.06125, which is larger then 0.05, so the next bit i
s not set

0  [1]1110110100# 0110

0.05 - 0 = 0.05.  2^-5 is 0.030625, which is smaller than 0.05, so the next bit
is set

0  [1]11101101001 0110

This represents the actual value

64 + 32 + 16 + 8 + 2 + 1 + 0.25 + 0.030625 = 123.280625

123.3 - 123.280625 is an error of 0.019375

It may be possible to reduce the error by rounding up to the next larger
number that can be represented (ie, add 2^-5).  This representation would
be

0 [1]11101101001 0110
+               1
0 [1]11101101010 0110

64 + 32 + 16 + 8 + 2 + 1 + 0.25 + 0.06125 = 123.31125

123.3 - 123.31125 is an error of -0.01125

This is a smaller error, so the representation is rounded to

0 [1]11101101010 0110 (123.31125) as the final representation.


The standard "real" and "double" floating point formats work exactly the
same, except they have wider mantiassas that reduce the magnitude of the
potential error, and wider exponents that extend the posssible range
of the number.

         Some frequently asked questions:

----------------------------------------------------------------------
1) Why doesn't round() work right with floating point? There is
garbage in the sixth decimal place when I round a floating point, as in:

declare @x real
select @x = 123.31
select @x = round(@x,1)
select @x
go
-------------
   123.300003

A:  The decimal rounded value of 123.31 is 123.3, but the real datatype
cannot store 123.3 exactly.  The garbage is due to the inherent imprecision
of the floating point format.  There is also a display issue:  isql by default
displays floats with 6 digits after the decimal point.  Some front end
programs are more intelligent about this: they know how many digits of the
number will be accurate and truncate or round the display at that point.  You
should use the TSQL str() function for better control over the display of
floating point data [str() is documented under "String Functions" in the
manuals.]  For instance, we rounded to 1 decimal place, so there is no need
to display past one decimal place:

select str(@x,8,1)
go
--------
   123.3

(1 row affected)


-----------------------------------------------------------------------
2) So just how inaccurate are reals and doubles?

4-byte reals can store a number with a maximum error of
(the number) * (2^-23)

8-byte doubles can store a number with a maximum error of
(the number) * (2^-53)

As a rule of thumb, this means you can expect the
first 7 digits of a real to be correct, and the first 15 digits of
a double to be correct.  After that, you may start seeing signs of
inaccuracy or "garbage".


------------------------------------------------------------------------
3)  When I declare a column or variable to be of type float, there is an
optional [(precision)] specification.  What effect does this have?

If precision is < 16, the server will use a 4-byte real.
If the precision is >= 16, the server will use a 8-byte double.
You can also explicitly tell the server to use type "real" or type
"double precision".  If you don't specify a precision and use "float",
the server will default to a double.

The (precision) specification otherwise has no effect.  The syntax may
seem somewhat pointless, but it is allowed for compatibility with DDL
developed for other systems that interpret (precision) differently.

------------------------------------------------------------------------

4)  So floating point only has problems storing fractions, right?

Nope.  You can see problems in whole numbers, too.

For instance, reals have 23 bits in the mantissa, so they will have problems
with numbers that require more than 23 bits to represent correctly.

The smallest value we see this for is 2^24+1.  Reals can store 2^24th
with no problem (it only requires the implicit bit being on, and the
exponent set to 24, all the other mantissa bits are zeroed), but 2^24+1
requires 22 zero bits and a final one bit following the implicit bit
(24 bits total, only 23 available).


1> select  power(2,24)-1, power(2,24), power(2,24)+1
2> go
----------- ----------- -----------
   16777215    16777216    16777217
(1 row affected)

1> create table float_test (x real, y real, z real)
2> go
1> insert float_test values (power(2,24)-1, power(2,24), power(2,24)+1)
2> go
(1 row affected)
1> select * from float_test
2> go
x                    y                    z
-------------------- -------------------- --------------------
     16777215.000000      16777216.000000      16777216.000000
(1 row affected)


Note that the closest representation of 2^24+1 in a real is equal to 2^24

-----------------------------------------------------------------------

5)  I don't see this behavior in my c/c++ program.  What's up?

You probably aren't looking hard enough.  In general, printf() in c and
cout in c++ do not print out with enough precision to show the problems.
The imprecision is hidden by rounding done by the display process.

Try specifying a higher precision, as in these two sample programs:

=========================================
For c:
=========================================
main()
/* Program to demonstrate floating point imprecision */
{
float r;
double d ;
r = 123.3;
d = 123.3;

printf("As a %d-byte float (real):    123.3 is %48.24f \n", sizeof(r),r);
printf("As a %d-byte double:  123.3 is %48.24f \n", sizeof(d),d);

}

Sample output on Solaris 2.5:

alliance1{bret}125: a.out
As a 4-byte real:    123.3 is                     123.300003051757812500000000
As a 8-byte double:  123.3 is                     123.299999999999997157829057


=========================================
For c++:
=========================================
#include <iostream.h>
#include <iomanip.h>

main()
-- Program to demonstrate floating point inaccuracy.
{

int precision;
float y;
y = 123.3;
       cout << "123.3 as a float printed with increasing precision" << endl;
       cout << "-------------------------------------------------------------"
<< endl;
       for (precision = 1; precision < 30; precision++)
       {
       cout.precision(precision);
       cout <<precision << "   " << y << endl;
       }
double x;
x = 123.3;
       cout << endl;
       cout << "123.3 as a double, printed with increasing precision" << endl;
       cout << "-------------------------------------------------------------"
<< endl;

       for (precision = 1; precision < 30; precision++)
       {
       cout.precision(precision);
       cout <<precision << "   " << x << endl;
       }
}

Sample output on Solaris 2.5:

alliance1{bret}140: a.out
123.3 as a float printed with increasing precision
-------------------------------------------------------------
1   1e+02
2   1.2e+02
3   123
4   123.3
5   123.3
6   123.3
7   123.3
8   123.3
9   123.300003
10   123.3000031
11   123.30000305
12   123.300003052
13   123.3000030518
14   123.30000305176
15   123.300003051758
16   123.3000030517578
17   123.30000305175781
18   123.300003051757812
19   123.3000030517578125
20   123.3000030517578125
21   123.3000030517578125
22   123.3000030517578125
23   123.3000030517578125
24   123.3000030517578125
25   123.3000030517578125
26   123.3000030517578125
27   123.3000030517578125
28   123.3000030517578125
29   123.3000030517578125

123.3 as a double, printed with increasing precision
-------------------------------------------------------------
1   1e+02
2   1.2e+02
3   123
4   123.3
5   123.3
6   123.3
7   123.3
8   123.3
9   123.3
10   123.3
11   123.3
12   123.3
13   123.3
14   123.3
15   123.3
16   123.3
17   123.3
18   123.299999999999997
19   123.2999999999999972
20   123.29999999999999716
21   123.299999999999997158
22   123.2999999999999971578
23   123.29999999999999715783
24   123.299999999999997157829
25   123.2999999999999971578291
26   123.29999999999999715782906
27   123.299999999999997157829057
28   123.299999999999997157829057
29   123.29999999999999715782905696

---------------------------------------------------------------------
6)  Where can I find more information on floating points?

Many books on assembly language programming go into great detail.

A search on the World Wide Web for keywords "IEEE" and "floating"
will provide many documents of interest.


--
    _________________________________________________________________

  Google Home - Advertise with Us - Search Solutions - Services & Tools
  - Jobs, Press, & Help

                               ©2002 Google