Inference for Categorical Data in R
Andrew Bray
Assistant Professor of Statistics at Reed College
library(dplyr)
glimpse(gss)
Observations: 3,300
Variables: 25
$ id <dbl> 518, 1092, 2094, 229, 979, 554, 491, 319, 3143, 1...
$ year <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1...
$ age <fct> 49, 22, 26, 75, 71, 33, 56, 33, 69, 40, 44, 42, 5...
$ class <fct> WORKING CLASS, WORKING CLASS, WORKING CLASS, LOWE...
$ degree <fct> HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, LT HIGH SC...
$ sex <fct> MALE, MALE, MALE, MALE, FEMALE, FEMALE, MALE, FEM...
$ happy <fct> HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, ...
gss2016 <- filter(gss, year == 2016)
ggplot(gss2016, aes(x = happy)) +
geom_bar()
gss2016 <- filter(gss, year == 2016)
ggplot(gss2016, aes(x = happy)) +
geom_bar()
p_hat <- gss2016 %>%
summarize(prop_happy = mean(happy == "HAPPY")) %>%
pull()
p_hat
0.7733333
$$(\hat{p} - 2 \times SE, \hat{p} + 2 \times SE)$$
Sample proportion plus or minus two standard errors
library(infer)
boot <- gss2016 %>%
specify(response = happy,
success = “HAPPY”) %>%
generate(reps = 500,
type = "bootstrap") %>%
calculate(stat = "prop")
boot
Response: happy (factor)
# A tibble: 500 x 2
replicate stat
<int> <dbl>
1 1 0.827
2 2 0.740
3 3 0.780
4 4 0.773
5 5 0.747
6 6 0.753
ggplot(boot, aes(x = stat)) +
geom_density()
SE <- boot %>%
summarize(sd(stat)) %>%
pull()
SE
0.03482251
$$(\hat{p} - 2 \times SE, \hat{p} + 2 \times SE)$$
c(p_hat - 2 * SE, p_hat + 2 * SE)
0.7051883 0.8412784
Inference for Categorical Data in R