Proof: Fundamental Theorem of Rounding

We prove the fundamental theorem establishing the basic error bound for floating point representation and arithmetic operations.

ProofFundamental Theorem of Rounding

Statement: For $x \in \mathbb{R}$ with $\beta^L \leq |x| \leq \beta^U$ , the floating point representation satisfies $\text{fl}(x) = x(1 + \delta)$ where $|\delta| \leq \frac{1}{2}\epsilon_{\text{mach}} = \frac{1}{2}\beta^{1-p}$ .

Part 1: Representation Error

Without loss of generality, assume $x > 0$ . Write $x$ in normalized form: $x = \beta^e \times (d_0.d_1d_2d_3\ldots)_\beta$ where $d_0 \neq 0$ and $e \in [L, U]$ .

The floating point representation truncates after $p$ digits: $\text{fl}(x) = \beta^e \times (d_0.d_1\ldots d_{p-1})_\beta$ for round-toward-zero. The absolute error is: $|x - \text{fl}(x)| = \beta^e \times (0.00\ldots 0d_pd_{p+1}\ldots)_\beta \leq \beta^e \times \beta^{-p}$ since $(0.d_pd_{p+1}\ldots)_\beta < 1$ .

The relative error is: $\frac{|x - \text{fl}(x)|}{|x|} \leq \frac{\beta^e \cdot \beta^{-p}}{\beta^e \cdot 1} = \beta^{-p} = \beta \cdot \beta^{1-p}$

For round-to-nearest, we round to the closest representable number. The maximum absolute error is half the gap between consecutive floating point numbers: $|x - \text{fl}(x)| \leq \frac{1}{2}\beta^e \cdot \beta^{1-p}$

Since $x \geq \beta^e \cdot 1$ , the relative error satisfies: $\frac{|x - \text{fl}(x)|}{|x|} \leq \frac{\beta^e \cdot \beta^{1-p}/2}{\beta^e \cdot 1} = \frac{1}{2}\beta^{1-p} = \frac{1}{2}\epsilon_{\text{mach}}$

Thus $\text{fl}(x) = x(1 + \delta)$ where $|\delta| \leq \frac{1}{2}\epsilon_{\text{mach}}$ .

Part 2: Arithmetic Operations

Consider multiplication: $z = x \times y$ . In exact arithmetic: $z = \beta^{e_x + e_y} \times (d_0^x.d_1^x\ldots)(d_0^y.d_1^y\ldots)$

The exact product mantissa has up to $2p-1$ significant digits. Rounding to $p$ digits: $\text{fl}(z) = z(1 + \delta_{\text{round}})$ where $|\delta_{\text{round}}| \leq \frac{1}{2}\epsilon_{\text{mach}}$ by Part 1.

Since the result may need normalization (shifting exponent by at most 1), and the normalized result is then rounded, we get: $\text{fl}(x \times y) = (x \times y)(1 + \delta)$ where $|\delta| \leq \epsilon_{\text{mach}}$ accounting for potential additional rounding.

For addition $z = x + y$ , assume $|x| \geq |y|$ and write: $y = \beta^{e_x} \times (d_0^y.d_1^y\ldots) \times \beta^{e_y - e_x}$

Aligning exponents requires shifting $y$ , then adding mantissas. The sum is: $z = \beta^{e_x} \times [(d_0^x.d_1^x\ldots) + (d_0^y.d_1^y\ldots) \times \beta^{e_y - e_x}]$

Rounding the result to $p$ digits introduces error $\leq \frac{1}{2}\epsilon_{\text{mach}}$ relative to the sum, giving: $\text{fl}(x + y) = (x + y)(1 + \delta)$ where $|\delta| \leq \epsilon_{\text{mach}}$ .

Subtraction and division follow similar arguments.

■

Remark

Tightness: The bound is tight; there exist inputs achieving $|\delta| = \frac{1}{2}\epsilon_{\text{mach}}$ . For example, $x = 1 + \frac{3}{4}\beta^{1-p}$ requires rounding up or down by exactly $\frac{1}{4}\beta^{1-p}$ , giving relative error $\frac{1}{4}\beta^{1-p}/(1 + \frac{3}{4}\beta^{1-p}) \approx \frac{1}{4}\epsilon_{\text{mach}}$ for small $\epsilon_{\text{mach}}$ .

This proof establishes the foundation for all floating point error analysis. The model $\text{fl}(x \circ y) = (x \circ y)(1 + \delta)$ with $|\delta| \leq \epsilon_{\text{mach}}$ enables compositional reasoning about algorithm accuracy through systematic tracking of accumulated rounding errors.