Hypothesis TestingTopic #28 of 33

Chi-Square Tests

Tests for categorical data: goodness-of-fit and test of independence.

Overview

Chi-square tests are used for categorical data to test goodness-of-fit (single variable) and independence (two variables).

Test Statistic

χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Where:

  • OO = observed frequency
  • EE = expected frequency

Goodness-of-Fit Test

Tests whether observed frequencies match expected frequencies.

Hypotheses

  • H0H_0: The data follows the specified distribution
  • H1H_1: The data does not follow the specified distribution

Expected Frequencies

E=n×pE = n \times p

Where nn = total observations, pp = hypothesized proportion

Degrees of Freedom

df=k1df = k - 1

Where kk = number of categories

Test of Independence

Tests whether two categorical variables are associated.

Hypotheses

  • H0H_0: The variables are independent
  • H1H_1: The variables are associated (dependent)

Expected Frequencies

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Degrees of Freedom

df=(r1)(c1)df = (r - 1)(c - 1)

Where rr = rows, cc = columns

Conditions

  1. Random sampling
  2. Independent observations
  3. Expected frequencies 5\geq 5 (all cells)

Examples

Example 1: Goodness-of-Fit

Testing if a die is fair (600 rolls):

Face123456
OO921089710388112
EE100100100100100100
χ2=(92100)2100+(108100)2100+(97100)2100+(103100)2100+(88100)2100+(112100)2100\chi^2 = \frac{(92-100)^2}{100} + \frac{(108-100)^2}{100} + \frac{(97-100)^2}{100} + \frac{(103-100)^2}{100} + \frac{(88-100)^2}{100} + \frac{(112-100)^2}{100} =0.64+0.64+0.09+0.09+1.44+1.44=4.34= 0.64 + 0.64 + 0.09 + 0.09 + 1.44 + 1.44 = 4.34 df=61=5,χcrit2(α=0.05)=11.07df = 6 - 1 = 5, \quad \chi^2_{\text{crit}} (\alpha=0.05) = 11.07 4.34<11.07Fail to reject H04.34 < 11.07 \Rightarrow \text{Fail to reject } H_0

The die appears fair.

Example 2: Test of Independence

Survey: Gender vs Product Preference (300 people)

Product AProduct BProduct CTotal
Male504060150
Female305070150
Total8090130300

Expected values:

E(Male, A)=150×80300=40E(\text{Male, A}) = \frac{150 \times 80}{300} = 40 E(Male, B)=150×90300=45E(\text{Male, B}) = \frac{150 \times 90}{300} = 45 E(Male, C)=150×130300=65E(\text{Male, C}) = \frac{150 \times 130}{300} = 65

And similarly for Female row.

χ2=(5040)240+(4045)245+(6065)265+(3040)240+(5045)245+(7065)265\chi^2 = \frac{(50-40)^2}{40} + \frac{(40-45)^2}{45} + \frac{(60-65)^2}{65} + \frac{(30-40)^2}{40} + \frac{(50-45)^2}{45} + \frac{(70-65)^2}{65} =2.5+0.56+0.38+2.5+0.56+0.38=6.88= 2.5 + 0.56 + 0.38 + 2.5 + 0.56 + 0.38 = 6.88 df=(21)(31)=2,χcrit2(α=0.05)=5.99df = (2-1)(3-1) = 2, \quad \chi^2_{\text{crit}} (\alpha=0.05) = 5.99 6.88>5.99Reject H06.88 > 5.99 \Rightarrow \text{Reject } H_0

Gender and product preference are associated.

Example 3: Homogeneity Test

Same calculation as independence, but tests if distributions are the same across groups.

Interpretation

For Goodness-of-Fit

  • Large χ2\chi^2 → Poor fit → Reject H0H_0
  • Small χ2\chi^2 → Good fit → Fail to reject H0H_0

For Independence

  • Large χ2\chi^2 → Variables are associated
  • Small χ2\chi^2 → Variables appear independent

Effect Size: Cramér's V

For test of independence:

V=χ2n×min(r1,c1)V = \sqrt{\frac{\chi^2}{n \times \min(r-1, c-1)}}
VVInterpretation
0.1Small
0.3Medium
0.5Large

Yates' Correction

For 2×2 tables:

χ2=(OE0.5)2E\chi^2 = \sum \frac{(\lvert O - E \rvert - 0.5)^2}{E}

Reduces Type I error for small samples.