Overview
The coefficient of determination, , measures the proportion of variance in the dependent variable that is predictable from the independent variable(s).
Definition
Where:
- = Sum of Squares Regression (explained)
- = Sum of Squares Error (unexplained)
- = Sum of Squares Total
Relationship to Correlation
For simple linear regression:
Where is the Pearson correlation coefficient.
Interpretation
| Value | Interpretation |
|---|---|
| 0% | Model explains none of the variance |
| 25% | Model explains 25% of the variance |
| 50% | Model explains 50% of the variance |
| 75% | Model explains 75% of the variance |
| 100% | Model explains all the variance |
Example Statement
" means that 64% of the variation in is explained by the linear relationship with ."
Properties
- Range:
- Non-negative (can be 0 but not negative)
- Higher = better fit
- Dimensionless (no units)
Adjusted
For multiple regression, adjusted accounts for the number of predictors:
Where:
- = sample size
- = number of predictors
Why Adjusted?
- Regular always increases with more predictors
- Adjusted penalizes for unnecessary variables
- Can decrease if new variable doesn't improve fit
Comparison
| Adjusted | |
|---|---|
| Always increases with more variables | Can decrease |
| Compare models with same | Compare models with different |
| Simple regression | Multiple regression |
Example Calculation
Given:
The model explains 75% of the variance in .
Limitations
Can Be Misleading
- Nonlinear relationships: High doesn't guarantee appropriateness
- Outliers: Can artificially inflate or deflate
- Not about causation: High causal relationship
- Sample-specific: May not generalize well
What Doesn't Tell You
- If the relationship is appropriate (linear vs nonlinear)
- If the predictors are correct
- If the model is good for prediction
- Whether assumptions are met
Good vs Bad
"Good" depends on the field:
| Field | Typical |
|---|---|
| Physical sciences | 0.90+ |
| Biological sciences | 0.70-0.90 |
| Social sciences | 0.50-0.70 |
| Psychology/Marketing | 0.30-0.50 |
Beyond
Always examine:
- Residual plots
- Practical significance
- Prediction accuracy
- Model assumptions
A model with moderate may be more useful than one with high if it's simpler and more interpretable.