Regression & CorrelationTopic #31 of 33

Simple Linear Regression

Modeling relationships: least squares method, regression equation, and interpreting slope and intercept.

Overview

Simple linear regression models the relationship between two variables by fitting a straight line to the data. It predicts the value of a dependent variable (YY) based on an independent variable (XX).

The Model

y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x

Or equivalently:

Y=β0+β1X+εY = \beta_0 + \beta_1 X + \varepsilon

Where:

  • y^\hat{y} = predicted value of YY
  • β0\beta_0 = y-intercept
  • β1\beta_1 = slope
  • ε\varepsilon = error term

Least Squares Method

Minimizes the sum of squared residuals:

SSE=(yiy^i)2SSE = \sum (y_i - \hat{y}_i)^2

Formulas

Slope (β1\beta_1)

β1=(xixˉ)(yiyˉ)(xixˉ)2\beta_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}

Or:

β1=r×sYsX\beta_1 = r \times \frac{s_Y}{s_X}

Or:

β1=nxiyi(xi)(yi)nxi2(xi)2\beta_1 = \frac{n\sum x_i y_i - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2}

Intercept (β0\beta_0)

β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}

Interpretation

Slope (β1\beta_1)

For each 1-unit increase in XX, YY is predicted to change by β1\beta_1 units.

Intercept (β0\beta_0)

The predicted value of YY when X=0X = 0. (May not be meaningful if X=0X = 0 is outside the data range)

Sum of Squares

SST=(yiyˉ)2(Total variation)SST = \sum (y_i - \bar{y})^2 \quad \text{(Total variation)} SSR=(y^iyˉ)2(Explained variation)SSR = \sum (\hat{y}_i - \bar{y})^2 \quad \text{(Explained variation)} SSE=(yiy^i)2(Unexplained variation)SSE = \sum (y_i - \hat{y}_i)^2 \quad \text{(Unexplained variation)} SST=SSR+SSESST = SSR + SSE

Standard Error of the Estimate

s=SSEn2s = \sqrt{\frac{SSE}{n - 2}}

Measures typical prediction error.

Assumptions

  1. Linearity: Relationship is linear
  2. Independence: Observations are independent
  3. Normality: Errors are normally distributed
  4. Homoscedasticity: Constant variance of errors

Example

Data: Hours studied (XX) vs Test score (YY)

xxyy
150
255
360
465
570
xˉ=3,yˉ=60\bar{x} = 3, \quad \bar{y} = 60 (xixˉ)(yiyˉ)=(2)(10)+(1)(5)+0+(1)(5)+(2)(10)=50\sum (x_i - \bar{x})(y_i - \bar{y}) = (-2)(-10) + (-1)(-5) + 0 + (1)(5) + (2)(10) = 50 (xixˉ)2=4+1+0+1+4=10\sum (x_i - \bar{x})^2 = 4 + 1 + 0 + 1 + 4 = 10 β1=5010=5\beta_1 = \frac{50}{10} = 5 β0=605(3)=45\beta_0 = 60 - 5(3) = 45 Regression equation: y^=45+5x\text{Regression equation: } \hat{y} = 45 + 5x

Interpretation: Each additional hour of study increases the predicted score by 5 points.

Making Predictions

For x=4x = 4 hours:

y^=45+5(4)=65\hat{y} = 45 + 5(4) = 65

Extrapolation Warning

Don't predict outside the range of observed XX values.

Residuals

ei=yiy^ie_i = y_i - \hat{y}_i

Residuals should be:

  • Random (no pattern)
  • Centered around 0
  • Constant spread (homoscedasticity)
  • Normally distributed

Residual Plots

Plot residuals vs predicted values or XX:

  • Look for patterns
  • Check for non-constant variance
  • Identify outliers