ConceptComplete

Simple Linear Regression

Linear regression models the relationship between a response variable and one or more explanatory variables, providing the most fundamental tool in statistical modeling and prediction.


The Model

Definition

The simple linear regression model posits that the response YiY_i is related to the predictor xix_i by Yi=β0+β1xi+ϵi,i=1,,nY_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n where β0\beta_0 is the intercept, β1\beta_1 is the slope, and ϵ1,,ϵn\epsilon_1, \ldots, \epsilon_n are independent errors with E[ϵi]=0E[\epsilon_i] = 0 and Var(ϵi)=σ2\operatorname{Var}(\epsilon_i) = \sigma^2. The parameters β0\beta_0, β1\beta_1, and σ2\sigma^2 are unknown.

Definition

The ordinary least squares (OLS) estimators minimize the sum of squared residuals: (β^0,β^1)=argminb0,b1i=1n(yib0b1xi)2(\hat{\beta}_0, \hat{\beta}_1) = \arg\min_{b_0, b_1} \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2 The solution is: β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=SxySxx,β^0=yˉβ^1xˉ\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}


Properties of OLS Estimators

ExampleUnbiasedness and variance

Under the model assumptions:

  • E[β^1]=β1E[\hat{\beta}_1] = \beta_1 and E[β^0]=β0E[\hat{\beta}_0] = \beta_0 (unbiased)
  • Var(β^1)=σ2Sxx=σ2(xixˉ)2\operatorname{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}} = \frac{\sigma^2}{\sum(x_i - \bar{x})^2}
  • Var(β^0)=σ2(1n+xˉ2Sxx)\operatorname{Var}(\hat{\beta}_0) = \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)
  • The unbiased estimator of σ2\sigma^2 is σ^2=1n2i=1n(yiy^i)2=SSEn2\hat{\sigma}^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{SSE}{n-2}

Coefficient of Determination

Remark$R^2$ and goodness of fit

The coefficient of determination R2=1SSESST=SSRSSTR^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST} where SST=(yiyˉ)2SST = \sum(y_i - \bar{y})^2 (total), SSR=(y^iyˉ)2SSR = \sum(\hat{y}_i - \bar{y})^2 (regression), and SSE=(yiy^i)2SSE = \sum(y_i - \hat{y}_i)^2 (error). It equals the square of the sample correlation: R2=rxy2R^2 = r_{xy}^2. While R2R^2 measures the proportion of variance explained, it always increases with more predictors, making the adjusted R2=1n1np1(1R2)R^2 = 1 - \frac{n-1}{n-p-1}(1 - R^2) more appropriate for model comparison.