Regression & CorrelationTopic #32 of 33

Coefficient of Determination

R-squared: measuring goodness of fit, explained vs unexplained variance.

Overview

The coefficient of determination, R2R^2, measures the proportion of variance in the dependent variable that is predictable from the independent variable(s).

Definition

R2=SSRSST=1SSESSTR^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}

Where:

  • SSRSSR = Sum of Squares Regression (explained)
  • SSESSE = Sum of Squares Error (unexplained)
  • SSTSST = Sum of Squares Total

Relationship to Correlation

For simple linear regression:

R2=r2R^2 = r^2

Where rr is the Pearson correlation coefficient.

Interpretation

R2R^2 ValueInterpretation
0%Model explains none of the variance
25%Model explains 25% of the variance
50%Model explains 50% of the variance
75%Model explains 75% of the variance
100%Model explains all the variance

Example Statement

"R2=0.64R^2 = 0.64 means that 64% of the variation in YY is explained by the linear relationship with XX."

Properties

  • Range: 0R210 \leq R^2 \leq 1
  • Non-negative (can be 0 but not negative)
  • Higher R2R^2 = better fit
  • Dimensionless (no units)

Adjusted R2R^2

For multiple regression, adjusted R2R^2 accounts for the number of predictors:

Radj2=1(1R2)(n1)nk1R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}

Where:

  • nn = sample size
  • kk = number of predictors

Why Adjusted?

  • Regular R2R^2 always increases with more predictors
  • Adjusted R2R^2 penalizes for unnecessary variables
  • Can decrease if new variable doesn't improve fit

Comparison

R2R^2Adjusted R2R^2
Always increases with more variablesCan decrease
Compare models with same nnCompare models with different kk
Simple regressionMultiple regression

Example Calculation

Given:

  • SST=500SST = 500
  • SSE=125SSE = 125
R2=1125500=10.25=0.75R^2 = 1 - \frac{125}{500} = 1 - 0.25 = 0.75

The model explains 75% of the variance in YY.

Limitations

R2R^2 Can Be Misleading

  1. Nonlinear relationships: High R2R^2 doesn't guarantee appropriateness
  2. Outliers: Can artificially inflate or deflate R2R^2
  3. Not about causation: High R2R^2 \neq causal relationship
  4. Sample-specific: May not generalize well

What R2R^2 Doesn't Tell You

  • If the relationship is appropriate (linear vs nonlinear)
  • If the predictors are correct
  • If the model is good for prediction
  • Whether assumptions are met

Good vs Bad R2R^2

"Good" R2R^2 depends on the field:

FieldTypical R2R^2
Physical sciences0.90+
Biological sciences0.70-0.90
Social sciences0.50-0.70
Psychology/Marketing0.30-0.50

Beyond R2R^2

Always examine:

  • Residual plots
  • Practical significance
  • Prediction accuracy
  • Model assumptions

A model with moderate R2R^2 may be more useful than one with high R2R^2 if it's simpler and more interpretable.