Regression Diagnostics and Assumptions

The validity of regression inference depends on several assumptions about the error structure. Diagnostic tools detect violations and guide model improvements.

The Assumptions

Definition

The classical linear regression assumptions (Gauss-Markov conditions) are:

Linearity: $E[Y_i | \mathbf{x}_i] = \mathbf{x}_i^T \boldsymbol{\beta}$ (correct functional form)
Independence: errors $\epsilon_i$ are independent
Homoscedasticity: $\operatorname{Var}(\epsilon_i) = \sigma^2$ (constant variance)
Normality (for inference): $\epsilon_i \sim N(0, \sigma^2)$

Assumption 4 is needed for exact $t$ -tests and $F$ -tests; assumptions 1-3 suffice for the Gauss-Markov theorem.

Diagnostic Plots

Definition

Key residual diagnostics:

Residuals vs. fitted values: plots $e_i$ vs. $\hat{y}_i$ . Patterns indicate nonlinearity or heteroscedasticity.
Normal Q-Q plot: plots ordered residuals vs. theoretical normal quantiles. Departures from the line indicate non-normality.
Scale-location plot: plots $\sqrt{|e_i|}$ vs. $\hat{y}_i$ . A trend indicates heteroscedasticity.
Leverage plot: the leverage $h_{ii} = [\mathbf{H}]_{ii}$ measures how far $\mathbf{x}_i$ is from the center of the predictor space. High leverage points have $h_{ii} > 2p/n$ .

ExampleDetecting heteroscedasticity

If residuals fan out (variance increases with $\hat{y}$ ), the homoscedasticity assumption is violated. Common remedies include:

Weighted least squares with weights $w_i = 1/\hat{\sigma}_i^2$
Variance-stabilizing transformation (e.g., $\log Y$ or $\sqrt{Y}$ )
Heteroscedasticity-robust standard errors (White/sandwich estimator)

Influential Points

RemarkCook's distance and outlier detection

Cook's distance $D_i = \frac{(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})^T(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})}{p\hat{\sigma}^2}$ measures the influence of the $i$ -th observation on all fitted values, where $\hat{\mathbf{Y}}_{(i)}$ is the fit with observation $i$ removed. Points with $D_i > 4/n$ or $D_i > 1$ are considered influential. Studentized residuals $t_i = e_i / (\hat{\sigma}_{(i)}\sqrt{1 - h_{ii}})$ follow a $t_{n-p-1}$ distribution and flag outliers when $|t_i| > 3$ .