The Gauss-Markov Theorem

The Gauss-Markov theorem establishes that ordinary least squares produces the best estimators among all linear unbiased estimators, justifying the central role of OLS in statistical practice.

The Theorem

Theorem10.4Gauss-Markov Theorem

In the linear model $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ with $E[\boldsymbol{\epsilon}] = \mathbf{0}$ and $\operatorname{Cov}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{I}$ (no normality assumed), the OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}$ is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$ . Specifically, for any linear combination $\mathbf{c}^T\boldsymbol{\beta}$ and any other linear unbiased estimator $\tilde{\theta} = \mathbf{a}^T\mathbf{Y}$ with $E[\tilde{\theta}] = \mathbf{c}^T\boldsymbol{\beta}$ for all $\boldsymbol{\beta}$ : $\operatorname{Var}(\mathbf{c}^T\hat{\boldsymbol{\beta}}) \leq \operatorname{Var}(\tilde{\theta})$

Best means minimum variance, Linear means the estimator is a linear function of $\mathbf{Y}$ , and Unbiased means $E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$ .

Implications

ExampleOptimality of the sample mean

For the model $Y_i = \mu + \epsilon_i$ (intercept only), $\mathbf{X} = \mathbf{1}_n$ and $\hat{\mu}_{OLS} = \bar{Y}$ . The Gauss-Markov theorem says $\bar{Y}$ has the smallest variance among all linear unbiased estimators of $\mu$ . No other linear combination $\sum a_i Y_i$ with $\sum a_i = 1$ can have smaller variance.

Extensions

Theorem10.5Generalized Gauss-Markov

If $\operatorname{Cov}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{V}$ for a known positive definite matrix $\mathbf{V} \neq \mathbf{I}$ (heteroscedasticity or correlation), the BLUE is the generalized least squares (GLS) estimator: $\hat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{V}^{-1}\mathbf{Y}$ This reduces to OLS when $\mathbf{V} = \mathbf{I}$ .

RemarkBeyond BLUE

The Gauss-Markov theorem restricts attention to linear unbiased estimators. Biased estimators (like ridge regression $\hat{\boldsymbol{\beta}}_\lambda = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{Y}$ ) can achieve lower MSE by trading a small bias for a large variance reduction, especially when predictors are highly correlated (multicollinearity). This observation motivates regularization methods that dominate modern statistics and machine learning.