Floating Point 101

FLOATING 101 POINT

FLOATING 100.999998 POINT

engineers we are

researchers we are

3.14159265358979 3238462643383279 5028841971693993 7510582097494459 2307816406286208 NUMBERS WE PLAY WITH ALL
DAY LONG

well, sometimes even at night. (yawn).

So, what is a floating point?

A floating point is ± D 1 .D 2 D
3 ···D n x Be

A floating point is sign ± D 1 .D 2
D 3 ···D n x Be

A floating point is significand ± D 1 .D 2
D 3 ···D n x Be

A floating point is base ± D 1 .D 2
D 3 ···D n x Be

A floating point is exponent ± D 1 .D 2
D 3 ···D n x Be

A floating point represents ± (D 1 + D 2
* B-1 + D 3 * B-2 + … + D n * B(n-1)) * Be

For example + 3.14 x 100 = (3 + 1*0.1
+ 4*0.01)*1 = 3.14

The point can float ! + 3.14 x 10-1 =
0.314

The point can float ! + 3.14 x 10+1 =
31.4

What if B = 2 ? + 1.00 x 2+2
= 4.0

Like machines do. http://grouper.ieee.org/groups/754/

Normalization of floating point

Multiple representations + 0.01 x 22 = 1.0 + 0.10
x 21 = 1.0 + 1.00 x 20 = 1.0

Normalized representation + 0.01 x 22 = 1.0 + 0.10
x 21 = 1.0 + 1.00 x 20 = 1.0

Normalized representation + (1.)000 x 20 1 is omitted

Normalized representation + (1.)000 x 20 there's room for an
extra digit!

Excess-127 representation -127 → 0 -126 → +1 … -1
→ +126 0 → +127

#include <float.h> FLT_MIN, FLT_MAX, ... #include <math.h> M_PI, M_E, NAN,
INFINITY, ...

Why no exact representation for 0.1?

FLOATING POINT REAL NUMBERS is used to represent

FLOATING POINT RATIONAL NUMBERS denotes a (finite) subset of

0.1 cannot be expressed as a power of 2 +
??? x 2??

+ 00 x 20 1 It's also a matter of
precision

+ 01 x 20 1 1.25 It's also a matter
of precision

+ 10 x 20 1 1.25 1.5 It's also a
matter of precision

+ 11 x 20 1 1.25 1.5 1.75 It's also
a matter of precision

+ 11 x 20 π/2 It's also a matter of
precision

+ 00 x 21 1 1.25 1.5 1.75 2.0 Not
just a matter of precision or basis...

+ 01 x 21 1 1.25 1.5 1.75 2.0 2.5
Not just a matter of precision or basis...

+ 10 x 21 1 1.25 1.5 1.75 2.0 2.5
3.0 Not just a matter of precision or basis...

Like death and taxes rounding errors are a fact of
life. http://wiki.octave.org/FAQ

+ 110 x 21 Operands that differ greatly + 100
x 2-2

x 21

x 21 = 110

Operands that are really close + 111 x 21 -
110 x 21 = 001 x 21

Operands that are really close + 111 x 21 -
110 x 21 = 100 x 2-2

Fixed point representation + 100.001010 = 22 + 2-3+ 2-5
= 4.15625

POINT WHAT'S THE WITH FLOATING

FP ARITHMETIC IS FAST Embedded in HW.

Single precision up to ~10+38. FP REPRESENTS A WIDE RANGE

HE APPROVES FP

Anyway, errors still there.

Okay, what about increasing the number of digits use decimal
representations estimating errors think before you type

More digits, please! double (52 significant bits) long double (112
significant bits) arbitrary precision * * language support needed

Use decimal representations! decimal (C# only) BigDecimal (Java only) std::decimal
(C++, coming soon)* * after IEEE-754 2008

Estimate the error of your algo rel_err = fabs(f –
fp) / f

Use float to represent time float time; while (true) time
+= 0.20;

Use float to represent time float time; while (true) time
+= 0.20; This is BAD. And you should feel BAD.

Compare float numbers (a == b)

Compare float numbers (a == b) fabs(a -b) <= FLT_EPSILON

Compare float numbers (a == b) fabs(a -b) <= FLT_EPSILON
fabs(a - b) <= max(fabs(a),fabs(b)) * pc

There is no silver bullet.

Use libraries (when available).

Vector addition (naive) float t[SIZE]; float result; for (i =
0; i < SIZE; ++i) result += t[i];

RESCUE GNU GSL TO THE

that's all folks! @lorisfichera – https://kid-a.github.com References and source code
available at https://github.com/kid-a/floating-point-seminar Credits Font: Yanone Kaffeesatz (http://www.yanone.de/typedesign/kaffeesatz/)

Floating Point 101

Floating Point 101

More Decks by kida

Other Decks in Programming

Featured

Transcript