ConceptComplete

Floating Point Representation

Computer arithmetic differs fundamentally from exact mathematical arithmetic due to finite precision. Understanding floating point representation is crucial for analyzing numerical algorithms and their stability properties.

DefinitionFloating Point Number System

A floating point number system F(β,p,L,U)\mathbb{F}(\beta, p, L, U) is characterized by:

  • Base β2\beta \geq 2 (typically 2 or 10)
  • Precision p1p \geq 1 (number of significant digits)
  • Exponent range [L,U][L, U] where L<0<UL < 0 < U

A normalized floating point number has the form: x=±βe×(d0.d1d2dp1)βx = \pm \beta^e \times (d_0.d_1d_2\cdots d_{p-1})_\beta where e[L,U]e \in [L, U] is the exponent and d00d_0 \neq 0 for normalized numbers, with 0di<β0 \leq d_i < \beta for all ii.

The IEEE 754 standard defines two primary formats: single precision (binary32) with β=2\beta = 2, p=24p = 24, and double precision (binary64) with β=2\beta = 2, p=53p = 53. The standard also specifies special values including ±\pm\infty, NaN (Not a Number), and subnormal numbers for gradual underflow.

ExampleIEEE 754 Double Precision

In double precision floating point:

  • 1 sign bit
  • 11 exponent bits (biased by 1023)
  • 52 fraction bits (plus 1 implicit leading bit)

The number x=0.110x = 0.1_{10} cannot be represented exactly. Its binary expansion is: 0.1=(0.00011)2=24(1.1001100110011)20.1 = (0.0\overline{0011})_2 = 2^{-4}(1.1001100110011\ldots)_2

The stored approximation after rounding to 53 bits introduces a relative error of approximately 5.5×10175.5 \times 10^{-17}.

Remark

Machine Epsilon: The gap between 1 and the next representable floating point number is ϵmach=β1p\epsilon_{\text{mach}} = \beta^{1-p}. For IEEE 754 double precision, ϵmach=2522.22×1016\epsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}. This quantity bounds the relative rounding error: if xRx \in \mathbb{R} is within the range of F\mathbb{F}, then fl(x)=x(1+δ)\text{fl}(x) = x(1 + \delta) where δϵmach|\delta| \leq \epsilon_{\text{mach}}.

The finite precision of floating point systems introduces two critical phenomena: overflow when x>βU|x| > \beta^U and underflow when 0<x<βL0 < |x| < \beta^L. Modern systems handle these through special values (infinity) and subnormal numbers respectively, though subnormals sacrifice precision for range.

Understanding these limitations is essential for designing robust numerical algorithms. Simple algebraic identities like a+(b+c)=(a+b)+ca + (b + c) = (a + b) + c fail in floating point arithmetic, requiring careful algorithm design to minimize accumulation of rounding errors through techniques like Kahan summation and compensated arithmetic.