Floating Point Representation

Computer arithmetic differs fundamentally from exact mathematical arithmetic due to finite precision. Understanding floating point representation is crucial for analyzing numerical algorithms and their stability properties.

DefinitionFloating Point Number System

A floating point number system $\mathbb{F}(\beta, p, L, U)$ is characterized by:

Base $\beta \geq 2$ (typically 2 or 10)
Precision $p \geq 1$ (number of significant digits)
Exponent range $[L, U]$ where $L < 0 < U$

A normalized floating point number has the form: $x = \pm \beta^e \times (d_0.d_1d_2\cdots d_{p-1})_\beta$ where $e \in [L, U]$ is the exponent and $d_0 \neq 0$ for normalized numbers, with $0 \leq d_i < \beta$ for all $i$ .

The IEEE 754 standard defines two primary formats: single precision (binary32) with $\beta = 2$ , $p = 24$ , and double precision (binary64) with $\beta = 2$ , $p = 53$ . The standard also specifies special values including $\pm\infty$ , NaN (Not a Number), and subnormal numbers for gradual underflow.

ExampleIEEE 754 Double Precision

In double precision floating point:

1 sign bit
11 exponent bits (biased by 1023)
52 fraction bits (plus 1 implicit leading bit)

The number $x = 0.1_{10}$ cannot be represented exactly. Its binary expansion is: $0.1 = (0.0\overline{0011})_2 = 2^{-4}(1.1001100110011\ldots)_2$

The stored approximation after rounding to 53 bits introduces a relative error of approximately $5.5 \times 10^{-17}$ .

Remark

Machine Epsilon: The gap between 1 and the next representable floating point number is $\epsilon_{\text{mach}} = \beta^{1-p}$ . For IEEE 754 double precision, $\epsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}$ . This quantity bounds the relative rounding error: if $x \in \mathbb{R}$ is within the range of $\mathbb{F}$ , then $\text{fl}(x) = x(1 + \delta)$ where $|\delta| \leq \epsilon_{\text{mach}}$ .

The finite precision of floating point systems introduces two critical phenomena: overflow when $|x| > \beta^U$ and underflow when $0 < |x| < \beta^L$ . Modern systems handle these through special values (infinity) and subnormal numbers respectively, though subnormals sacrifice precision for range.

Understanding these limitations is essential for designing robust numerical algorithms. Simple algebraic identities like $a + (b + c) = (a + b) + c$ fail in floating point arithmetic, requiring careful algorithm design to minimize accumulation of rounding errors through techniques like Kahan summation and compensated arithmetic.