Floating Point Representation
Computer arithmetic differs fundamentally from exact mathematical arithmetic due to finite precision. Understanding floating point representation is crucial for analyzing numerical algorithms and their stability properties.
A floating point number system is characterized by:
- Base (typically 2 or 10)
- Precision (number of significant digits)
- Exponent range where
A normalized floating point number has the form: where is the exponent and for normalized numbers, with for all .
The IEEE 754 standard defines two primary formats: single precision (binary32) with , , and double precision (binary64) with , . The standard also specifies special values including , NaN (Not a Number), and subnormal numbers for gradual underflow.
In double precision floating point:
- 1 sign bit
- 11 exponent bits (biased by 1023)
- 52 fraction bits (plus 1 implicit leading bit)
The number cannot be represented exactly. Its binary expansion is:
The stored approximation after rounding to 53 bits introduces a relative error of approximately .
Machine Epsilon: The gap between 1 and the next representable floating point number is . For IEEE 754 double precision, . This quantity bounds the relative rounding error: if is within the range of , then where .
The finite precision of floating point systems introduces two critical phenomena: overflow when and underflow when . Modern systems handle these through special values (infinity) and subnormal numbers respectively, though subnormals sacrifice precision for range.
Understanding these limitations is essential for designing robust numerical algorithms. Simple algebraic identities like fail in floating point arithmetic, requiring careful algorithm design to minimize accumulation of rounding errors through techniques like Kahan summation and compensated arithmetic.