Regression & CorrelationTopic #30 of 33

Correlation

Measuring linear relationships: Pearson's r, Spearman's ρ, and interpreting correlation coefficients.

Overview

Correlation measures the strength and direction of the linear relationship between two quantitative variables.

Pearson Correlation Coefficient (rr)

Formula

r=(xixˉ)(yiyˉ)(xixˉ)2×(yiyˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \times \sum (y_i - \bar{y})^2}}

Or equivalently:

r=nxiyi(xi)(yi)[nxi2(xi)2][nyi2(yi)2]r = \frac{n\sum x_i y_i - (\sum x_i)(\sum y_i)}{\sqrt{[n\sum x_i^2 - (\sum x_i)^2][n\sum y_i^2 - (\sum y_i)^2]}}

Properties

  • Range: 1r1-1 \leq r \leq 1
  • r=1r = 1: Perfect positive linear relationship
  • r=1r = -1: Perfect negative linear relationship
  • r=0r = 0: No linear relationship

Interpretation

r\lvert r \rvertStrength
0.00 - 0.19Very weak
0.20 - 0.39Weak
0.40 - 0.59Moderate
0.60 - 0.79Strong
0.80 - 1.00Very strong

Direction

rr SignDirectionInterpretation
r>0r > 0PositiveAs XX increases, YY tends to increase
r<0r < 0NegativeAs XX increases, YY tends to decrease

Spearman's Rank Correlation (ρ\rho)

For ordinal data or non-linear relationships:

ρ=16di2n(n21)\rho = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}

Where did_i = difference between ranks of corresponding values.

Assumptions (Pearson)

  1. Continuous data
  2. Linear relationship
  3. Bivariate normality (for inference)
  4. No significant outliers

Hypothesis Testing

Hypotheses

  • H0H_0: ρ=0\rho = 0 (no linear correlation)
  • H1H_1: ρ0\rho \neq 0 (or one-tailed alternative)

Test Statistic

t=rn21r2t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}

With df=n2df = n - 2

Examples

Example 1: Calculating rr

xxyy
24
35
57
79
810
x=25,y=35,xy=193\sum x = 25, \quad \sum y = 35, \quad \sum xy = 193 x2=151,y2=271,n=5\sum x^2 = 151, \quad \sum y^2 = 271, \quad n = 5 r=5(193)(25)(35)[5(151)625][5(271)1225]r = \frac{5(193) - (25)(35)}{\sqrt{[5(151) - 625][5(271) - 1225]}} =965875(130)(130)=90130=0.692= \frac{965 - 875}{\sqrt{(130)(130)}} = \frac{90}{130} = 0.692

Strong positive correlation.

Example 2: Testing Significance

r=0.65r = 0.65, n=20n = 20, α=0.05\alpha = 0.05

t=0.6520210.652=0.65×4.2430.5775=2.760.76=3.63t = \frac{0.65\sqrt{20-2}}{\sqrt{1 - 0.65^2}} = \frac{0.65 \times 4.243}{\sqrt{0.5775}} = \frac{2.76}{0.76} = 3.63 df=18,tcrit=2.101df = 18, \quad t_{\text{crit}} = 2.101 3.63>2.101Reject H03.63 > 2.101 \Rightarrow \text{Reject } H_0

Significant correlation exists.

Important Cautions

Correlation ≠ Causation

Just because XX and YY are correlated does NOT mean:

  • XX causes YY
  • YY causes XX
  • There's any causal connection

A third variable may explain both (confounding).

Restricted Range

Limiting the range of XX or YY artificially reduces rr.

Outliers

Single outliers can dramatically change rr.

Nonlinear Relationships

rr only measures linear relationships. A perfect curve may have r0r \approx 0.

Coefficient of Determination

r2=coefficient of determinationr^2 = \text{coefficient of determination}

Interpretation: The proportion of variance in YY explained by XX.

Example: r=0.8r2=0.64r = 0.8 \Rightarrow r^2 = 0.64 \Rightarrow 64% of variance in YY is explained by XX.