Simple Linear Regression

Linear regression models the relationship between a response variable and one or more explanatory variables, providing the most fundamental tool in statistical modeling and prediction.

The Model

Definition

The simple linear regression model posits that the response $Y_i$ is related to the predictor $x_i$ by $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n$ where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon_1, \ldots, \epsilon_n$ are independent errors with $E[\epsilon_i] = 0$ and $\operatorname{Var}(\epsilon_i) = \sigma^2$ . The parameters $\beta_0$ , $\beta_1$ , and $\sigma^2$ are unknown.

Definition

The ordinary least squares (OLS) estimators minimize the sum of squared residuals: $(\hat{\beta}_0, \hat{\beta}_1) = \arg\min_{b_0, b_1} \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2$ The solution is: $\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

Properties of OLS Estimators

ExampleUnbiasedness and variance

Under the model assumptions:

$E[\hat{\beta}_1] = \beta_1$ and $E[\hat{\beta}_0] = \beta_0$ (unbiased)
$\operatorname{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}} = \frac{\sigma^2}{\sum(x_i - \bar{x})^2}$
$\operatorname{Var}(\hat{\beta}_0) = \sigma^2\left(\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}\right)$
The unbiased estimator of $\sigma^2$ is $\hat{\sigma}^2 = \frac{1}{n-2}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{SSE}{n-2}$

Coefficient of Determination

Remark$R^2$ and goodness of fit

The coefficient of determination $R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST}$ where $SST = \sum(y_i - \bar{y})^2$ (total), $SSR = \sum(\hat{y}_i - \bar{y})^2$ (regression), and $SSE = \sum(y_i - \hat{y}_i)^2$ (error). It equals the square of the sample correlation: $R^2 = r_{xy}^2$ . While $R^2$ measures the proportion of variance explained, it always increases with more predictors, making the adjusted $R^2 = 1 - \frac{n-1}{n-p-1}(1 - R^2)$ more appropriate for model comparison.