Regression Diagnostics and Assumptions
The validity of regression inference depends on several assumptions about the error structure. Diagnostic tools detect violations and guide model improvements.
The Assumptions
Definition
The classical linear regression assumptions (Gauss-Markov conditions) are:
- Linearity: (correct functional form)
- Independence: errors are independent
- Homoscedasticity: (constant variance)
- Normality (for inference):
Assumption 4 is needed for exact -tests and -tests; assumptions 1-3 suffice for the Gauss-Markov theorem.
Diagnostic Plots
Definition
Key residual diagnostics:
- Residuals vs. fitted values: plots vs. . Patterns indicate nonlinearity or heteroscedasticity.
- Normal Q-Q plot: plots ordered residuals vs. theoretical normal quantiles. Departures from the line indicate non-normality.
- Scale-location plot: plots vs. . A trend indicates heteroscedasticity.
- Leverage plot: the leverage measures how far is from the center of the predictor space. High leverage points have .
ExampleDetecting heteroscedasticity
If residuals fan out (variance increases with ), the homoscedasticity assumption is violated. Common remedies include:
- Weighted least squares with weights
- Variance-stabilizing transformation (e.g., or )
- Heteroscedasticity-robust standard errors (White/sandwich estimator)
Influential Points
RemarkCook's distance and outlier detection
Cook's distance measures the influence of the -th observation on all fitted values, where is the fit with observation removed. Points with or are considered influential. Studentized residuals follow a distribution and flag outliers when .