Fundamental Theorem of Rounding

The fundamental theorem of rounding establishes the basic error bound for representing real numbers in floating point arithmetic, forming the foundation for all subsequent error analysis.

TheoremFundamental Theorem of Rounding

Let $\mathbb{F}(\beta, p, L, U)$ be a floating point system with round-to-nearest rounding. For any real number $x$ satisfying $\beta^L \leq |x| \leq \beta^U$ , the floating point representation $\text{fl}(x)$ satisfies: $\text{fl}(x) = x(1 + \delta)$ where $|\delta| \leq \frac{1}{2}\beta^{1-p} = \frac{1}{2}\epsilon_{\text{mach}}$ .

Furthermore, for any basic arithmetic operation $\circ \in \{+, -, \times, \div\}$ : $\text{fl}(x \circ y) = (x \circ y)(1 + \delta)$ where $|\delta| \leq \epsilon_{\text{mach}}$ , assuming no overflow or underflow occurs.

This theorem provides the standard model for floating point arithmetic analysis. The bound $\frac{1}{2}\epsilon_{\text{mach}}$ reflects round-to-nearest rounding; directed rounding modes (round toward zero, round toward $\pm\infty$ ) satisfy the weaker bound $|\delta| \leq \epsilon_{\text{mach}}$ .

Remark

The multiplicative error model $(1 + \delta)$ is crucial for error analysis because it naturally leads to relative error bounds. In contrast, an additive error model $\text{fl}(x) = x + \epsilon$ would require tracking absolute errors, which is less informative for numbers of vastly different magnitudes.

The theorem's power lies in its composability. For a sequence of $n$ operations, if each introduces error $(1 + \delta_i)$ with $|\delta_i| \leq \epsilon_{\text{mach}}$ , the accumulated error is: $(1 + \delta_1)(1 + \delta_2)\cdots(1 + \delta_n) = 1 + \theta_n$ where first-order analysis gives $|\theta_n| \leq n\epsilon_{\text{mach}}$ for small $\epsilon_{\text{mach}}$ .

ExampleApplication to Inner Product

Computing $s = \sum_{i=1}^n x_i y_i$ in floating point:

Each product $x_i y_i$ has relative error $\leq \epsilon_{\text{mach}}$
Summing $n$ terms accumulates error $\leq n\epsilon_{\text{mach}}$
Total relative error: $|\text{fl}(s) - s|/|s| \leq (n+1)\epsilon_{\text{mach}} + O(\epsilon_{\text{mach}}^2)$

For IEEE 754 double precision with $n = 10^6$ , this gives error $\approx 2.22 \times 10^{-10}$ , losing about 6 decimal digits from machine precision.

The theorem assumes correct rounding, which IEEE 754 guarantees for basic operations. Extended precision arithmetic and fused multiply-add (FMA) operations can achieve tighter bounds, with FMA computing $x \cdot y + z$ with only a single rounding error rather than two.

Understanding this theorem enables rigorous error analysis: backward error analysis proves algorithms produce results as if computed exactly with slightly perturbed inputs, quantified by multiples of $\epsilon_{\text{mach}}$ . Forward error analysis uses condition numbers to bound output error given input perturbations. Together, these techniques form the theoretical foundation of numerical stability analysis.