Statement: For x∈R with βL≤∣x∣≤βU, the floating point representation satisfies fl(x)=x(1+δ) where ∣δ∣≤21ϵmach=21β1−p.
Part 1: Representation Error
Without loss of generality, assume x>0. Write x in normalized form:
x=βe×(d0.d1d2d3…)β
where d0=0 and e∈[L,U].
The floating point representation truncates after p digits:
fl(x)=βe×(d0.d1…dp−1)β
for round-toward-zero. The absolute error is:
∣x−fl(x)∣=βe×(0.00…0dpdp+1…)β≤βe×β−p
since (0.dpdp+1…)β<1.
The relative error is:
∣x∣∣x−fl(x)∣≤βe⋅1βe⋅β−p=β−p=β⋅β1−p
For round-to-nearest, we round to the closest representable number. The maximum absolute error is half the gap between consecutive floating point numbers:
∣x−fl(x)∣≤21βe⋅β1−p
Since x≥βe⋅1, the relative error satisfies:
∣x∣∣x−fl(x)∣≤βe⋅1βe⋅β1−p/2=21β1−p=21ϵmach
Thus fl(x)=x(1+δ) where ∣δ∣≤21ϵmach.
Part 2: Arithmetic Operations
Consider multiplication: z=x×y. In exact arithmetic:
z=βex+ey×(d0x.d1x…)(d0y.d1y…)
The exact product mantissa has up to 2p−1 significant digits. Rounding to p digits:
fl(z)=z(1+δround)
where ∣δround∣≤21ϵmach by Part 1.
Since the result may need normalization (shifting exponent by at most 1), and the normalized result is then rounded, we get:
fl(x×y)=(x×y)(1+δ)
where ∣δ∣≤ϵmach accounting for potential additional rounding.
For addition z=x+y, assume ∣x∣≥∣y∣ and write:
y=βex×(d0y.d1y…)×βey−ex
Aligning exponents requires shifting y, then adding mantissas. The sum is:
z=βex×[(d0x.d1x…)+(d0y.d1y…)×βey−ex]
Rounding the result to p digits introduces error ≤21ϵmach relative to the sum, giving:
fl(x+y)=(x+y)(1+δ)
where ∣δ∣≤ϵmach.
Subtraction and division follow similar arguments.