Alternate method: the chi-squared distribution

Inference for Categorical Data in R

Andrew Bray

Assistant Professor of Statistics at Reed College

Approximation distributions: normal

Statistics: $\hat{p}, \hat{p}_{1} - \hat{p}_{2}$

Approximation distributions: chi-squared

Statistics: $\hat{x}^{2}$
Shape is determined by degrees of freedom
$df = (nrows - 1) \times (ncols - 1)$

H-test via approximation

null_spac <- gss_party %>%
  specify(natspac ~ party) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "Chisq")

ggplot(null_spac, aes(x = stat)) +
  geom_density() +
  stat_function(
    fun = dchisq, 
    args = list(df = 4), 
    color = "blue"
  ) +
  geom_vline(xintercept = chi_obs_spac, color = "red")

H-test via approximation

gss_party %>%
  select(natarms, party) %>%
  table()

             party
natarms        D  I  R
  TOO LITTLE  17 20 24
  ABOUT RIGHT 14 28  8
  TOO MUCH    12 24  2

pchisq(chi_obs_spac, df = 4)

X-squared 
0.1430612

1 - pchisq(chi_obs_spac, df = 4)

X-squared 
0.8569388

The chi-squared distribution

Becomes a good approximation when:

$expected\_count >= 5$
$df >= 2$

Let's practice!

Inference for Categorical Data in R