ConceptComplete

Regression Diagnostics and Assumptions

The validity of regression inference depends on several assumptions about the error structure. Diagnostic tools detect violations and guide model improvements.


The Assumptions

Definition

The classical linear regression assumptions (Gauss-Markov conditions) are:

  1. Linearity: E[Yixi]=xiTβE[Y_i | \mathbf{x}_i] = \mathbf{x}_i^T \boldsymbol{\beta} (correct functional form)
  2. Independence: errors ϵi\epsilon_i are independent
  3. Homoscedasticity: Var(ϵi)=σ2\operatorname{Var}(\epsilon_i) = \sigma^2 (constant variance)
  4. Normality (for inference): ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2)

Assumption 4 is needed for exact tt-tests and FF-tests; assumptions 1-3 suffice for the Gauss-Markov theorem.


Diagnostic Plots

Definition

Key residual diagnostics:

  1. Residuals vs. fitted values: plots eie_i vs. y^i\hat{y}_i. Patterns indicate nonlinearity or heteroscedasticity.
  2. Normal Q-Q plot: plots ordered residuals vs. theoretical normal quantiles. Departures from the line indicate non-normality.
  3. Scale-location plot: plots ei\sqrt{|e_i|} vs. y^i\hat{y}_i. A trend indicates heteroscedasticity.
  4. Leverage plot: the leverage hii=[H]iih_{ii} = [\mathbf{H}]_{ii} measures how far xi\mathbf{x}_i is from the center of the predictor space. High leverage points have hii>2p/nh_{ii} > 2p/n.
ExampleDetecting heteroscedasticity

If residuals fan out (variance increases with y^\hat{y}), the homoscedasticity assumption is violated. Common remedies include:

  • Weighted least squares with weights wi=1/σ^i2w_i = 1/\hat{\sigma}_i^2
  • Variance-stabilizing transformation (e.g., logY\log Y or Y\sqrt{Y})
  • Heteroscedasticity-robust standard errors (White/sandwich estimator)

Influential Points

RemarkCook's distance and outlier detection

Cook's distance Di=(Y^Y^(i))T(Y^Y^(i))pσ^2D_i = \frac{(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})^T(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})}{p\hat{\sigma}^2} measures the influence of the ii-th observation on all fitted values, where Y^(i)\hat{\mathbf{Y}}_{(i)} is the fit with observation ii removed. Points with Di>4/nD_i > 4/n or Di>1D_i > 1 are considered influential. Studentized residuals ti=ei/(σ^(i)1hii)t_i = e_i / (\hat{\sigma}_{(i)}\sqrt{1 - h_{ii}}) follow a tnp1t_{n-p-1} distribution and flag outliers when ti>3|t_i| > 3.