ProofComplete

Proof: Fundamental Theorem of Rounding

We prove the fundamental theorem establishing the basic error bound for floating point representation and arithmetic operations.

ProofFundamental Theorem of Rounding

Statement: For xRx \in \mathbb{R} with βLxβU\beta^L \leq |x| \leq \beta^U, the floating point representation satisfies fl(x)=x(1+δ)\text{fl}(x) = x(1 + \delta) where δ12ϵmach=12β1p|\delta| \leq \frac{1}{2}\epsilon_{\text{mach}} = \frac{1}{2}\beta^{1-p}.

Part 1: Representation Error

Without loss of generality, assume x>0x > 0. Write xx in normalized form: x=βe×(d0.d1d2d3)βx = \beta^e \times (d_0.d_1d_2d_3\ldots)_\beta where d00d_0 \neq 0 and e[L,U]e \in [L, U].

The floating point representation truncates after pp digits: fl(x)=βe×(d0.d1dp1)β\text{fl}(x) = \beta^e \times (d_0.d_1\ldots d_{p-1})_\beta for round-toward-zero. The absolute error is: xfl(x)=βe×(0.000dpdp+1)ββe×βp|x - \text{fl}(x)| = \beta^e \times (0.00\ldots 0d_pd_{p+1}\ldots)_\beta \leq \beta^e \times \beta^{-p} since (0.dpdp+1)β<1(0.d_pd_{p+1}\ldots)_\beta < 1.

The relative error is: xfl(x)xβeβpβe1=βp=ββ1p\frac{|x - \text{fl}(x)|}{|x|} \leq \frac{\beta^e \cdot \beta^{-p}}{\beta^e \cdot 1} = \beta^{-p} = \beta \cdot \beta^{1-p}

For round-to-nearest, we round to the closest representable number. The maximum absolute error is half the gap between consecutive floating point numbers: xfl(x)12βeβ1p|x - \text{fl}(x)| \leq \frac{1}{2}\beta^e \cdot \beta^{1-p}

Since xβe1x \geq \beta^e \cdot 1, the relative error satisfies: xfl(x)xβeβ1p/2βe1=12β1p=12ϵmach\frac{|x - \text{fl}(x)|}{|x|} \leq \frac{\beta^e \cdot \beta^{1-p}/2}{\beta^e \cdot 1} = \frac{1}{2}\beta^{1-p} = \frac{1}{2}\epsilon_{\text{mach}}

Thus fl(x)=x(1+δ)\text{fl}(x) = x(1 + \delta) where δ12ϵmach|\delta| \leq \frac{1}{2}\epsilon_{\text{mach}}.

Part 2: Arithmetic Operations

Consider multiplication: z=x×yz = x \times y. In exact arithmetic: z=βex+ey×(d0x.d1x)(d0y.d1y)z = \beta^{e_x + e_y} \times (d_0^x.d_1^x\ldots)(d_0^y.d_1^y\ldots)

The exact product mantissa has up to 2p12p-1 significant digits. Rounding to pp digits: fl(z)=z(1+δround)\text{fl}(z) = z(1 + \delta_{\text{round}}) where δround12ϵmach|\delta_{\text{round}}| \leq \frac{1}{2}\epsilon_{\text{mach}} by Part 1.

Since the result may need normalization (shifting exponent by at most 1), and the normalized result is then rounded, we get: fl(x×y)=(x×y)(1+δ)\text{fl}(x \times y) = (x \times y)(1 + \delta) where δϵmach|\delta| \leq \epsilon_{\text{mach}} accounting for potential additional rounding.

For addition z=x+yz = x + y, assume xy|x| \geq |y| and write: y=βex×(d0y.d1y)×βeyexy = \beta^{e_x} \times (d_0^y.d_1^y\ldots) \times \beta^{e_y - e_x}

Aligning exponents requires shifting yy, then adding mantissas. The sum is: z=βex×[(d0x.d1x)+(d0y.d1y)×βeyex]z = \beta^{e_x} \times [(d_0^x.d_1^x\ldots) + (d_0^y.d_1^y\ldots) \times \beta^{e_y - e_x}]

Rounding the result to pp digits introduces error 12ϵmach\leq \frac{1}{2}\epsilon_{\text{mach}} relative to the sum, giving: fl(x+y)=(x+y)(1+δ)\text{fl}(x + y) = (x + y)(1 + \delta) where δϵmach|\delta| \leq \epsilon_{\text{mach}}.

Subtraction and division follow similar arguments.

Remark

Tightness: The bound is tight; there exist inputs achieving δ=12ϵmach|\delta| = \frac{1}{2}\epsilon_{\text{mach}}. For example, x=1+34β1px = 1 + \frac{3}{4}\beta^{1-p} requires rounding up or down by exactly 14β1p\frac{1}{4}\beta^{1-p}, giving relative error 14β1p/(1+34β1p)14ϵmach\frac{1}{4}\beta^{1-p}/(1 + \frac{3}{4}\beta^{1-p}) \approx \frac{1}{4}\epsilon_{\text{mach}} for small ϵmach\epsilon_{\text{mach}}.

This proof establishes the foundation for all floating point error analysis. The model fl(xy)=(xy)(1+δ)\text{fl}(x \circ y) = (x \circ y)(1 + \delta) with δϵmach|\delta| \leq \epsilon_{\text{mach}} enables compositional reasoning about algorithm accuracy through systematic tracking of accumulated rounding errors.