Alternate method: the chi-squared distribution

Inference for Categorical Data in R

Andrew Bray

Assistant Professor of Statistics at Reed College

Approximation distributions: normal

  • Statistics: $\hat{p}, \hat{p}_{1} - \hat{p}_{2}$

categorical-inference-ch3v3-normal-curve.png

Inference for Categorical Data in R

Approximation distributions: chi-squared

  • Statistics: $\hat{x}^{2}$
  • Shape is determined by degrees of freedom
  • $df = (nrows - 1) \times (ncols - 1)$

categorical-inference-ch3v3-chisq-curve.png

Inference for Categorical Data in R

H-test via approximation

null_spac <- gss_party %>%
  specify(natspac ~ party) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "Chisq")
ggplot(null_spac, aes(x = stat)) +
  geom_density() +
  stat_function(
    fun = dchisq, 
    args = list(df = 4), 
    color = "blue"
  ) +
  geom_vline(xintercept = chi_obs_spac, color = "red")

ch3v3-htest-approx.png

Inference for Categorical Data in R

H-test via approximation

gss_party %>%
  select(natarms, party) %>%
  table()
             party
natarms        D  I  R
  TOO LITTLE  17 20 24
  ABOUT RIGHT 14 28  8
  TOO MUCH    12 24  2
pchisq(chi_obs_spac, df = 4)
X-squared 
0.1430612
1 - pchisq(chi_obs_spac, df = 4)
X-squared 
0.8569388

ch3v3-htest-approx.png

Inference for Categorical Data in R

The chi-squared distribution

Becomes a good approximation when:

  • $expected\_count >= 5$
  • $df >= 2$

categorical-inference-ch3v3-chisq-curve.png

Inference for Categorical Data in R

Let's practice!

Inference for Categorical Data in R

Preparing Video For Download...