TheoremComplete

Convergence Theorem for Markov Chains

The convergence theorem describes the long-term behavior of the transition probabilities pij(n)p_{ij}^{(n)} for irreducible, aperiodic, positive recurrent Markov chains. It states that the distribution of the chain converges to the unique stationary distribution, regardless of the initial state.


The main result

Theorem1.1Convergence to stationarity

Let (Xn)nβ‰₯0(X_n)_{n \geq 0} be an irreducible, aperiodic, positive recurrent Markov chain on a countable state space SS with unique stationary distribution Ο€\pi. Then for all states i,j∈Si, j \in S:

lim⁑nβ†’βˆžpij(n)=Ο€j.\lim_{n \to \infty} p_{ij}^{(n)} = \pi_j.

Moreover, the convergence is uniform in ii: for any Ξ΅>0\varepsilon > 0, there exists NN such that for all nβ‰₯Nn \geq N and all i∈Si \in S:

∣pij(n)βˆ’Ο€j∣<Ξ΅.|p_{ij}^{(n)} - \pi_j| < \varepsilon.

This theorem says that no matter where the chain starts, the probability of being in state jj after nn steps converges to Ο€j\pi_j as nβ†’βˆžn \to \infty. The initial distribution is "forgotten" asymptotically.

RemarkWhy aperiodicity is required

Aperiodicity ensures that pij(n)p_{ij}^{(n)} does not oscillate as nβ†’βˆžn \to \infty. For periodic chains, the sequence {pij(n)}\{p_{ij}^{(n)}\} may fail to converge (e.g., the cyclic chain on {0,1}\{0, 1\}), but the CesΓ ro averages 1nβˆ‘k=0nβˆ’1pij(k)\frac{1}{n}\sum_{k=0}^{n-1} p_{ij}^{(k)} still converge to Ο€j\pi_j.


Proof outline

The proof combines several key ingredients:

Step 1: Coupling argument. Construct two independent copies of the chain, (Xn)(X_n) and (Yn)(Y_n), starting from states ii and iβ€²i' respectively. By irreducibility and aperiodicity, there is a positive probability that Xn=YnX_n = Y_n for some finite nn. Once they meet, they remain together forever (by the Markov property).

Step 2: Total variation distance. Define the total variation distance between distributions ΞΌ\mu and Ξ½\nu as

βˆ₯ΞΌβˆ’Ξ½βˆ₯TV=12βˆ‘j∈S∣μjβˆ’Ξ½j∣.\|\mu - \nu\|_{TV} = \frac{1}{2} \sum_{j \in S} |\mu_j - \nu_j|.

Then,

βˆ₯P(Xnβˆˆβ‹…βˆ£X0=i)βˆ’Ο€βˆ₯TV=P(Xnβ‰ Yn),\|\mathbb{P}(X_n \in \cdot \mid X_0 = i) - \pi\|_{TV} = \mathbb{P}(X_n \neq Y_n),

where YnY_n is an independent copy starting from the stationary distribution.

Step 3: Exponential decay. By aperiodicity, there exists ρ<1\rho < 1 such that

βˆ₯P(Xnβˆˆβ‹…βˆ£X0=i)βˆ’Ο€βˆ₯TV≀Cρn\|\mathbb{P}(X_n \in \cdot \mid X_0 = i) - \pi\|_{TV} \leq C \rho^n

for some constant CC. This is the geometric ergodicity property. The rate ρ\rho depends on the spectral gap of the transition matrix.

Step 4: Pointwise convergence. Since the total variation distance controls pointwise differences,

∣pij(n)βˆ’Ο€jβˆ£β‰€2βˆ₯P(Xnβˆˆβ‹…βˆ£X0=i)βˆ’Ο€βˆ₯TVβ†’0.|p_{ij}^{(n)} - \pi_j| \leq 2 \|\mathbb{P}(X_n \in \cdot \mid X_0 = i) - \pi\|_{TV} \to 0.

RemarkFinite state space

For irreducible, aperiodic chains on a finite state space, the convergence is always geometrically fast: there exist constants C,ρ<1C, \rho < 1 such that

max⁑i,j∣pij(n)βˆ’Ο€jβˆ£β‰€Cρn.\max_{i,j} |p_{ij}^{(n)} - \pi_j| \leq C \rho^n.

The rate ρ\rho is the second-largest eigenvalue of PP (in absolute value).


Examples

ExampleTwo-state chain

For the transition matrix

P=(1βˆ’Ξ±Ξ±Ξ²1βˆ’Ξ²),P = \begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix},

the eigenvalues are Ξ»1=1\lambda_1 = 1 and Ξ»2=1βˆ’Ξ±βˆ’Ξ²\lambda_2 = 1 - \alpha - \beta. The stationary distribution is Ο€=(Ξ²/(Ξ±+Ξ²),Ξ±/(Ξ±+Ξ²))\pi = (\beta/(\alpha+\beta), \alpha/(\alpha+\beta)). We have

Pn=1Ξ±+Ξ²(Ξ²Ξ±Ξ²Ξ±)+(1βˆ’Ξ±βˆ’Ξ²)n1Ξ±+Ξ²(Ξ±βˆ’Ξ±βˆ’Ξ²Ξ²).P^n = \frac{1}{\alpha+\beta} \begin{pmatrix} \beta & \alpha \\ \beta & \alpha \end{pmatrix} + (1-\alpha-\beta)^n \frac{1}{\alpha+\beta} \begin{pmatrix} \alpha & -\alpha \\ -\beta & \beta \end{pmatrix}.

Thus,

pij(n)=Ο€j+O((1βˆ’Ξ±βˆ’Ξ²)n).p_{ij}^{(n)} = \pi_j + O((1-\alpha-\beta)^n).

The convergence rate is ρ=∣1βˆ’Ξ±βˆ’Ξ²βˆ£\rho = |1 - \alpha - \beta|.

ExampleLazy random walk

A lazy random walk on a graph GG stays at the current vertex with probability 1/21/2 and moves to a uniformly random neighbor with probability 1/21/2. This modification ensures aperiodicity: pii(n)β‰₯(1/2)n>0p_{ii}^{(n)} \geq (1/2)^n > 0 for all nn, so every state is aperiodic.

For the lazy walk on the cycle Z/NZ\mathbb{Z}/N\mathbb{Z}, the stationary distribution is uniform: Ο€i=1/N\pi_i = 1/N. The convergence rate is governed by the second eigenvalue of the transition matrix, which is approximately 1βˆ’Ο€2/(2N2)1 - \pi^2/(2N^2) for large NN.


Mixing time

Definition1.1Mixing time

The mixing time of a Markov chain is the smallest tt such that

max⁑i∈Sβˆ₯P(Xtβˆˆβ‹…βˆ£X0=i)βˆ’Ο€βˆ₯TV≀14.\max_{i \in S} \|\mathbb{P}(X_t \in \cdot \mid X_0 = i) - \pi\|_{TV} \leq \frac{1}{4}.

Equivalently, after time tt, the distribution is within total variation distance 1/41/4 of the stationary distribution, regardless of the starting state.

The mixing time quantifies how long it takes for the chain to "forget" its initial condition.

ExampleMixing time of random walk on the hypercube

The hypercube {0,1}d\{0,1\}^d has 2d2^d vertices. A random walk on the hypercube flips a uniformly random coordinate at each step. The stationary distribution is uniform: Ο€(x)=2βˆ’d\pi(x) = 2^{-d}.

The mixing time is O(dlog⁑d)O(d \log d): after cdlog⁑dc d \log d steps (for a suitable constant cc), the distribution is close to uniform. This is a foundational result in the theory of random walks on graphs.


Rate of convergence and the spectral gap

Definition1.2Spectral gap

For a finite-state reversible Markov chain with eigenvalues 1=Ξ»1>Ξ»2β‰₯β‹―β‰₯Ξ»nβ‰₯βˆ’11 = \lambda_1 > \lambda_2 \geq \cdots \geq \lambda_n \geq -1, the spectral gap is

Ξ³=1βˆ’Ξ»2.\gamma = 1 - \lambda_2.

A larger spectral gap implies faster convergence to stationarity.

Theorem1.2Spectral gap and mixing time

For a reversible Markov chain on a finite state space, the mixing time tmixt_{\text{mix}} satisfies

1Ξ³log⁑(12Ο€min⁑)≀tmix≀1Ξ³log⁑(1Ο€min⁑),\frac{1}{\gamma} \log\left(\frac{1}{2\pi_{\min}}\right) \leq t_{\text{mix}} \leq \frac{1}{\gamma} \log\left(\frac{1}{\pi_{\min}}\right),

where Ο€min⁑=min⁑i∈SΟ€i\pi_{\min} = \min_{i \in S} \pi_i.

This theorem connects the algebraic property (spectral gap) to the probabilistic property (mixing time). Techniques from spectral graph theory and linear algebra provide powerful tools for bounding mixing times.

ExampleRandom walk on the complete graph

On the complete graph KnK_n (all pairs of vertices connected), a random walk moves to a uniformly random neighbor at each step. The stationary distribution is uniform: Ο€i=1/n\pi_i = 1/n. The spectral gap is Ξ³=1/(nβˆ’1)\gamma = 1/(n-1), so the mixing time is O(nlog⁑n)O(n \log n).


Periodic chains and CesΓ ro convergence

Theorem1.3CesΓ ro ergodic theorem

For any irreducible, positive recurrent Markov chain (possibly periodic), the CesΓ ro averages converge:

lim⁑nβ†’βˆž1nβˆ‘k=0nβˆ’1pij(k)=Ο€j.\lim_{n \to \infty} \frac{1}{n} \sum_{k=0}^{n-1} p_{ij}^{(k)} = \pi_j.

This weaker form of convergence holds even for periodic chains. However, for practical applications (e.g., MCMC), we usually modify the chain to be aperiodic (e.g., by adding a self-loop with positive probability).

ExampleDeterministic cycle

On the cycle {0,1,2}\{0, 1, 2\} with deterministic transitions 0β†’1β†’2β†’00 \to 1 \to 2 \to 0, the stationary distribution is uniform: Ο€=(1/3,1/3,1/3)\pi = (1/3, 1/3, 1/3). However,

p00(n)={1ifΒ n≑0(mod3),0otherwise.p_{00}^{(n)} = \begin{cases} 1 & \text{if } n \equiv 0 \pmod{3}, \\ 0 & \text{otherwise}. \end{cases}

The sequence oscillates, but

1nβˆ‘k=0nβˆ’1p00(k)β†’13.\frac{1}{n} \sum_{k=0}^{n-1} p_{00}^{(k)} \to \frac{1}{3}.


Summary

The convergence theorem describes the asymptotic behavior of Markov chains:

  • Irreducible + aperiodic + positive recurrent: pij(n)β†’Ο€jp_{ij}^{(n)} \to \pi_j as nβ†’βˆžn \to \infty.
  • Geometric convergence: For finite state spaces, the rate is exponential, governed by the spectral gap.
  • Mixing time: Quantifies how long it takes to reach near-stationarity.
  • CesΓ ro convergence: Weaker result that holds even for periodic chains.

These results are central to applications in MCMC, statistical physics (Gibbs measures), and randomized algorithms (e.g., approximate counting).