Theorem (Gauss-Markov): Under the model Y=Xβ+ϵ with E[ϵ]=0 and Cov(ϵ)=σ2I, the OLS estimator β^=(XTX)−1XTY has the property that for any c∈Rp, cTβ^ has the smallest variance among all linear unbiased estimators of cTβ.
Step 1: OLS is linear and unbiased.
β^=(XTX)−1XTY is clearly a linear function of Y. For unbiasedness:
E[β^]=(XTX)−1XTE[Y]=(XTX)−1XTXβ=β
Its covariance matrix is:
Cov(β^)=(XTX)−1XT(σ2I)X(XTX)−1=σ2(XTX)−1
Step 2: Consider any other linear unbiased estimator.
Let β~=CY be another linear estimator of β. Write C=(XTX)−1XT+D where D=C−(XTX)−1XT.
For unbiasedness: E[β~]=CXβ=β for all β, so CX=Ip.
Substituting C=(XTX)−1XT+D:
CX=(XTX)−1XTX+DX=I+DX=I
Therefore DX=0.
Step 3: Compute the covariance of β~.
Cov(β~)=σ2CCT=σ2[(XTX)−1XT+D][(XTX)−1XT+D]T
Expanding:
=σ2[(XTX)−1XTX(XTX)−1+(XTX)−1XTDT+DX(XTX)−1+DDT]
Since DX=0, the cross terms vanish: DX(XTX)−1=0 and (XTX)−1XTDT=(DX(XTX)−1)T=0.
Therefore:
Cov(β~)=σ2(XTX)−1+σ2DDT=Cov(β^)+σ2DDT
Step 4: Conclude optimality.
The matrix DDT is positive semi-definite (vTDDTv=∥DTv∥2≥0). Therefore:
Var(cTβ~)=cTCov(β~)c=cTCov(β^)c+σ2∥DTc∥2≥Var(cTβ^)
Equality holds if and only if DTc=0, i.e., D=0 when we want equality for all c, meaning β~=β^. □