Proof of the Gauss-Markov Theorem

We prove that the ordinary least squares estimator is the best linear unbiased estimator (BLUE) of the regression coefficients.

Proof

Theorem (Gauss-Markov): Under the model $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ with $E[\boldsymbol{\epsilon}] = \mathbf{0}$ and $\operatorname{Cov}(\boldsymbol{\epsilon}) = \sigma^2\mathbf{I}$ , the OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$ has the property that for any $\mathbf{c} \in \mathbb{R}^p$ , $\mathbf{c}^T\hat{\boldsymbol{\beta}}$ has the smallest variance among all linear unbiased estimators of $\mathbf{c}^T\boldsymbol{\beta}$ .

Step 1: OLS is linear and unbiased.

$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$ is clearly a linear function of $\mathbf{Y}$ . For unbiasedness: $E[\hat{\boldsymbol{\beta}}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\mathbf{Y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}$

Its covariance matrix is: $\operatorname{Cov}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T (\sigma^2\mathbf{I}) \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$

Step 2: Consider any other linear unbiased estimator.

Let $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{Y}$ be another linear estimator of $\boldsymbol{\beta}$ . Write $\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}$ where $\mathbf{D} = \mathbf{C} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ .

For unbiasedness: $E[\tilde{\boldsymbol{\beta}}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}$ for all $\boldsymbol{\beta}$ , so $\mathbf{C}\mathbf{X} = \mathbf{I}_p$ .

Substituting $\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}$ : $\mathbf{C}\mathbf{X} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X} + \mathbf{D}\mathbf{X} = \mathbf{I} + \mathbf{D}\mathbf{X} = \mathbf{I}$

Therefore $\mathbf{D}\mathbf{X} = \mathbf{0}$ .

Step 3: Compute the covariance of $\tilde{\boldsymbol{\beta}}$ .

$\operatorname{Cov}(\tilde{\boldsymbol{\beta}}) = \sigma^2 \mathbf{C}\mathbf{C}^T = \sigma^2 [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T$

Expanding: $= \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T + \mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T]$

Since $\mathbf{D}\mathbf{X} = \mathbf{0}$ , the cross terms vanish: $\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} = \mathbf{0}$ and $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T = (\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1})^T = \mathbf{0}$ .

Therefore: $\operatorname{Cov}(\tilde{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^T = \operatorname{Cov}(\hat{\boldsymbol{\beta}}) + \sigma^2\mathbf{D}\mathbf{D}^T$

Step 4: Conclude optimality.

The matrix $\mathbf{D}\mathbf{D}^T$ is positive semi-definite ( $\mathbf{v}^T\mathbf{D}\mathbf{D}^T\mathbf{v} = \|\mathbf{D}^T\mathbf{v}\|^2 \geq 0$ ). Therefore: $\operatorname{Var}(\mathbf{c}^T\tilde{\boldsymbol{\beta}}) = \mathbf{c}^T\operatorname{Cov}(\tilde{\boldsymbol{\beta}})\mathbf{c} = \mathbf{c}^T\operatorname{Cov}(\hat{\boldsymbol{\beta}})\mathbf{c} + \sigma^2\|\mathbf{D}^T\mathbf{c}\|^2 \geq \operatorname{Var}(\mathbf{c}^T\hat{\boldsymbol{\beta}})$

Equality holds if and only if $\mathbf{D}^T\mathbf{c} = \mathbf{0}$ , i.e., $\mathbf{D} = \mathbf{0}$ when we want equality for all $\mathbf{c}$ , meaning $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}$ . $\square$

■

RemarkThe geometric perspective

The OLS estimator $\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}$ is the orthogonal projection of $\mathbf{Y}$ onto the column space of $\mathbf{X}$ . The Gauss-Markov theorem says this geometric projection minimizes the variance of any linear estimate. The residual $\mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}}$ is orthogonal to the column space, which is the Pythagorean theorem underlying the ANOVA decomposition $SST = SSR + SSE$ .