Convergence Theorem for Markov Chains

The convergence theorem describes the long-term behavior of the transition probabilities $p_{ij}^{(n)}$ for irreducible, aperiodic, positive recurrent Markov chains. It states that the distribution of the chain converges to the unique stationary distribution, regardless of the initial state.

The main result

Theorem1.1Convergence to stationarity

Let $(X_n)_{n \geq 0}$ be an irreducible, aperiodic, positive recurrent Markov chain on a countable state space $S$ with unique stationary distribution $\pi$ . Then for all states $i, j \in S$ :

$\lim_{n \to \infty} p_{ij}^{(n)} = \pi_j.$

Moreover, the convergence is uniform in $i$ : for any $\varepsilon > 0$ , there exists $N$ such that for all $n \geq N$ and all $i \in S$ :

$|p_{ij}^{(n)} - \pi_j| < \varepsilon.$

This theorem says that no matter where the chain starts, the probability of being in state $j$ after $n$ steps converges to $\pi_j$ as $n \to \infty$ . The initial distribution is "forgotten" asymptotically.

RemarkWhy aperiodicity is required

Aperiodicity ensures that $p_{ij}^{(n)}$ does not oscillate as $n \to \infty$ . For periodic chains, the sequence $\{p_{ij}^{(n)}\}$ may fail to converge (e.g., the cyclic chain on $\{0, 1\}$ ), but the Cesàro averages $\frac{1}{n}\sum_{k=0}^{n-1} p_{ij}^{(k)}$ still converge to $\pi_j$ .

Proof outline

The proof combines several key ingredients:

Step 1: Coupling argument. Construct two independent copies of the chain, $(X_n)$ and $(Y_n)$ , starting from states $i$ and $i'$ respectively. By irreducibility and aperiodicity, there is a positive probability that $X_n = Y_n$ for some finite $n$ . Once they meet, they remain together forever (by the Markov property).

Step 2: Total variation distance. Define the total variation distance between distributions $\mu$ and $\nu$ as

$\|\mu - \nu\|_{TV} = \frac{1}{2} \sum_{j \in S} |\mu_j - \nu_j|.$

Then,

$\|\mathbb{P}(X_n \in \cdot \mid X_0 = i) - \pi\|_{TV} = \mathbb{P}(X_n \neq Y_n),$

where $Y_n$ is an independent copy starting from the stationary distribution.

Step 3: Exponential decay. By aperiodicity, there exists $\rho < 1$ such that

$\|\mathbb{P}(X_n \in \cdot \mid X_0 = i) - \pi\|_{TV} \leq C \rho^n$

for some constant $C$ . This is the geometric ergodicity property. The rate $\rho$ depends on the spectral gap of the transition matrix.

Step 4: Pointwise convergence. Since the total variation distance controls pointwise differences,

$|p_{ij}^{(n)} - \pi_j| \leq 2 \|\mathbb{P}(X_n \in \cdot \mid X_0 = i) - \pi\|_{TV} \to 0.$

RemarkFinite state space

For irreducible, aperiodic chains on a finite state space, the convergence is always geometrically fast: there exist constants $C, \rho < 1$ such that

$\max_{i,j} |p_{ij}^{(n)} - \pi_j| \leq C \rho^n.$

The rate $\rho$ is the second-largest eigenvalue of $P$ (in absolute value).

Examples

ExampleTwo-state chain

For the transition matrix

$P = \begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix},$

the eigenvalues are $\lambda_1 = 1$ and $\lambda_2 = 1 - \alpha - \beta$ . The stationary distribution is $\pi = (\beta/(\alpha+\beta), \alpha/(\alpha+\beta))$ . We have

$P^n = \frac{1}{\alpha+\beta} \begin{pmatrix} \beta & \alpha \\ \beta & \alpha \end{pmatrix} + (1-\alpha-\beta)^n \frac{1}{\alpha+\beta} \begin{pmatrix} \alpha & -\alpha \\ -\beta & \beta \end{pmatrix}.$

Thus,

$p_{ij}^{(n)} = \pi_j + O((1-\alpha-\beta)^n).$

The convergence rate is $\rho = |1 - \alpha - \beta|$ .

ExampleLazy random walk

A lazy random walk on a graph $G$ stays at the current vertex with probability $1/2$ and moves to a uniformly random neighbor with probability $1/2$ . This modification ensures aperiodicity: $p_{ii}^{(n)} \geq (1/2)^n > 0$ for all $n$ , so every state is aperiodic.

For the lazy walk on the cycle $\mathbb{Z}/N\mathbb{Z}$ , the stationary distribution is uniform: $\pi_i = 1/N$ . The convergence rate is governed by the second eigenvalue of the transition matrix, which is approximately $1 - \pi^2/(2N^2)$ for large $N$ .

Mixing time

Definition1.1Mixing time

The mixing time of a Markov chain is the smallest $t$ such that

$\max_{i \in S} \|\mathbb{P}(X_t \in \cdot \mid X_0 = i) - \pi\|_{TV} \leq \frac{1}{4}.$

Equivalently, after time $t$ , the distribution is within total variation distance $1/4$ of the stationary distribution, regardless of the starting state.

The mixing time quantifies how long it takes for the chain to "forget" its initial condition.

ExampleMixing time of random walk on the hypercube

The hypercube $\{0,1\}^d$ has $2^d$ vertices. A random walk on the hypercube flips a uniformly random coordinate at each step. The stationary distribution is uniform: $\pi(x) = 2^{-d}$ .

The mixing time is $O(d \log d)$ : after $c d \log d$ steps (for a suitable constant $c$ ), the distribution is close to uniform. This is a foundational result in the theory of random walks on graphs.

Rate of convergence and the spectral gap

Definition1.2Spectral gap

For a finite-state reversible Markov chain with eigenvalues $1 = \lambda_1 > \lambda_2 \geq \cdots \geq \lambda_n \geq -1$ , the spectral gap is

$\gamma = 1 - \lambda_2.$

A larger spectral gap implies faster convergence to stationarity.

Theorem1.2Spectral gap and mixing time

For a reversible Markov chain on a finite state space, the mixing time $t_{\text{mix}}$ satisfies

$\frac{1}{\gamma} \log\left(\frac{1}{2\pi_{\min}}\right) \leq t_{\text{mix}} \leq \frac{1}{\gamma} \log\left(\frac{1}{\pi_{\min}}\right),$

where $\pi_{\min} = \min_{i \in S} \pi_i$ .

This theorem connects the algebraic property (spectral gap) to the probabilistic property (mixing time). Techniques from spectral graph theory and linear algebra provide powerful tools for bounding mixing times.

ExampleRandom walk on the complete graph

On the complete graph $K_n$ (all pairs of vertices connected), a random walk moves to a uniformly random neighbor at each step. The stationary distribution is uniform: $\pi_i = 1/n$ . The spectral gap is $\gamma = 1/(n-1)$ , so the mixing time is $O(n \log n)$ .

Periodic chains and Cesàro convergence

Theorem1.3Cesàro ergodic theorem

For any irreducible, positive recurrent Markov chain (possibly periodic), the Cesàro averages converge:

$\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} p_{ij}^{(k)} = \pi_j.$

This weaker form of convergence holds even for periodic chains. However, for practical applications (e.g., MCMC), we usually modify the chain to be aperiodic (e.g., by adding a self-loop with positive probability).

ExampleDeterministic cycle

On the cycle $\{0, 1, 2\}$ with deterministic transitions $0 \to 1 \to 2 \to 0$ , the stationary distribution is uniform: $\pi = (1/3, 1/3, 1/3)$ . However,

$p_{00}^{(n)} = \begin{cases} 1 & \text{if } n \equiv 0 \pmod{3}, \\ 0 & \text{otherwise}. \end{cases}$

The sequence oscillates, but

$\frac{1}{n} \sum_{k=0}^{n-1} p_{00}^{(k)} \to \frac{1}{3}.$

Summary

The convergence theorem describes the asymptotic behavior of Markov chains:

Irreducible + aperiodic + positive recurrent: $p_{ij}^{(n)} \to \pi_j$ as $n \to \infty$ .
Geometric convergence: For finite state spaces, the rate is exponential, governed by the spectral gap.
Mixing time: Quantifies how long it takes to reach near-stationarity.
Cesàro convergence: Weaker result that holds even for periodic chains.

These results are central to applications in MCMC, statistical physics (Gibbs measures), and randomized algorithms (e.g., approximate counting).