ProofComplete

Proof of the Gauss-Markov Theorem

We prove that the ordinary least squares estimator is the best linear unbiased estimator (BLUE) of the regression coefficients.


Proof

Theorem (Gauss-Markov): Under the model Y=Xβ+ϵ\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} with E[ϵ]=0E[\boldsymbol{\epsilon}] = \mathbf{0} and Cov(ϵ)=σ2I\operatorname{Cov}(\boldsymbol{\epsilon}) = \sigma^2\mathbf{I}, the OLS estimator β^=(XTX)1XTY\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} has the property that for any cRp\mathbf{c} \in \mathbb{R}^p, cTβ^\mathbf{c}^T\hat{\boldsymbol{\beta}} has the smallest variance among all linear unbiased estimators of cTβ\mathbf{c}^T\boldsymbol{\beta}.

Step 1: OLS is linear and unbiased.

β^=(XTX)1XTY\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} is clearly a linear function of Y\mathbf{Y}. For unbiasedness: E[β^]=(XTX)1XTE[Y]=(XTX)1XTXβ=βE[\hat{\boldsymbol{\beta}}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\mathbf{Y}] = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}

Its covariance matrix is: Cov(β^)=(XTX)1XT(σ2I)X(XTX)1=σ2(XTX)1\operatorname{Cov}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T (\sigma^2\mathbf{I}) \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}

Step 2: Consider any other linear unbiased estimator.

Let β~=CY\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{Y} be another linear estimator of β\boldsymbol{\beta}. Write C=(XTX)1XT+D\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D} where D=C(XTX)1XT\mathbf{D} = \mathbf{C} - (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T.

For unbiasedness: E[β~]=CXβ=βE[\tilde{\boldsymbol{\beta}}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta} for all β\boldsymbol{\beta}, so CX=Ip\mathbf{C}\mathbf{X} = \mathbf{I}_p.

Substituting C=(XTX)1XT+D\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}: CX=(XTX)1XTX+DX=I+DX=I\mathbf{C}\mathbf{X} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X} + \mathbf{D}\mathbf{X} = \mathbf{I} + \mathbf{D}\mathbf{X} = \mathbf{I}

Therefore DX=0\mathbf{D}\mathbf{X} = \mathbf{0}.

Step 3: Compute the covariance of β~\tilde{\boldsymbol{\beta}}.

Cov(β~)=σ2CCT=σ2[(XTX)1XT+D][(XTX)1XT+D]T\operatorname{Cov}(\tilde{\boldsymbol{\beta}}) = \sigma^2 \mathbf{C}\mathbf{C}^T = \sigma^2 [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T

Expanding: =σ2[(XTX)1XTX(XTX)1+(XTX)1XTDT+DX(XTX)1+DDT]= \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T + \mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T]

Since DX=0\mathbf{D}\mathbf{X} = \mathbf{0}, the cross terms vanish: DX(XTX)1=0\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} = \mathbf{0} and (XTX)1XTDT=(DX(XTX)1)T=0(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T = (\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1})^T = \mathbf{0}.

Therefore: Cov(β~)=σ2(XTX)1+σ2DDT=Cov(β^)+σ2DDT\operatorname{Cov}(\tilde{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^T = \operatorname{Cov}(\hat{\boldsymbol{\beta}}) + \sigma^2\mathbf{D}\mathbf{D}^T

Step 4: Conclude optimality.

The matrix DDT\mathbf{D}\mathbf{D}^T is positive semi-definite (vTDDTv=DTv20\mathbf{v}^T\mathbf{D}\mathbf{D}^T\mathbf{v} = \|\mathbf{D}^T\mathbf{v}\|^2 \geq 0). Therefore: Var(cTβ~)=cTCov(β~)c=cTCov(β^)c+σ2DTc2Var(cTβ^)\operatorname{Var}(\mathbf{c}^T\tilde{\boldsymbol{\beta}}) = \mathbf{c}^T\operatorname{Cov}(\tilde{\boldsymbol{\beta}})\mathbf{c} = \mathbf{c}^T\operatorname{Cov}(\hat{\boldsymbol{\beta}})\mathbf{c} + \sigma^2\|\mathbf{D}^T\mathbf{c}\|^2 \geq \operatorname{Var}(\mathbf{c}^T\hat{\boldsymbol{\beta}})

Equality holds if and only if DTc=0\mathbf{D}^T\mathbf{c} = \mathbf{0}, i.e., D=0\mathbf{D} = \mathbf{0} when we want equality for all c\mathbf{c}, meaning β~=β^\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}. \square


RemarkThe geometric perspective

The OLS estimator Y^=HY\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y} is the orthogonal projection of Y\mathbf{Y} onto the column space of X\mathbf{X}. The Gauss-Markov theorem says this geometric projection minimizes the variance of any linear estimate. The residual e=YY^\mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}} is orthogonal to the column space, which is the Pythagorean theorem underlying the ANOVA decomposition SST=SSR+SSESST = SSR + SSE.