The General Social Survey

Inference for Categorical Data in R

Andrew Bray

Assistant Professor of Statistics at Reed College

ex-1-1 copy.002.png

ex-1-1 copy.003.png

ex-1-1 copy.004.png

ex-1-1 copy.005.png

ex-1-1 copy.006.png

Exploring GSS

library(dplyr)
glimpse(gss)

Observations: 3,300
Variables: 25
$ id       <dbl> 518, 1092, 2094, 229, 979, 554, 491, 319, 3143, 1...
$ year     <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1...
$ age      <fct> 49, 22, 26, 75, 71, 33, 56, 33, 69, 40, 44, 42, 5...
$ class    <fct> WORKING CLASS, WORKING CLASS, WORKING CLASS, LOWE...
$ degree   <fct> HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, LT HIGH SC...
$ sex      <fct> MALE, MALE, MALE, MALE, FEMALE, FEMALE, MALE, FEM...

$ happy    <fct> HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, ...

Exploring GSS

gss2016 <- filter(gss, year == 2016)

ggplot(gss2016, aes(x = happy)) +
  geom_bar()

Exploring GSS

gss2016 <- filter(gss, year == 2016)

ggplot(gss2016, aes(x = happy)) +
  geom_bar()

Exploring GSS

p_hat <- gss2016 %>%
  summarize(prop_happy = mean(happy == "HAPPY")) %>%
  pull()

p_hat

0.7733333

General 95% confidence interval

$$(\hat{p} - 2 \times SE, \hat{p} + 2 \times SE)$$

Sample proportion plus or minus two standard errors

Bootstrap

ex-1-1 copy.016.png

Bootstrap

ex-1-1 copy.017.png

Bootstrap

ex-1-1 copy.018.png

Bootstrap

ex-1-1 copy.019.png

Bootstrap

ex-1-1 copy.020.png

Bootstrap

ex-1-1 copy.021.png

Bootstrap

ex-1-1 copy.022.png

Bootstrap

ex-1-1 copy.023.png

Bootstrap

ex-1-1 copy.024.png

Bootstrap

ex-1-1 copy.025.png

Bootstrap

ex-1-1 copy.026.png

Bootstrap

ex-1-1 copy.027.png

Bootstrap

ex-1-1 copy.028.png

Bootstrap

ex-1-1 copy.029.png

Bootstrap Confidence Interval

library(infer)
boot <- gss2016 %>%
  specify(response = happy, 
          success = “HAPPY”) %>%
  generate(reps = 500, 
           type = "bootstrap") %>%
  calculate(stat = "prop")

boot

Response: happy (factor)
# A tibble: 500 x 2
   replicate  stat
       <int> <dbl>
 1         1 0.827
 2         2 0.740
 3         3 0.780
 4         4 0.773
 5         5 0.747
 6         6 0.753

Bootstrap Confidence Interval

ggplot(boot, aes(x = stat)) +
  geom_density()

Bootstrap Confidence Interval

SE <- boot %>%
  summarize(sd(stat)) %>%
  pull()

SE

0.03482251

$$(\hat{p} - 2 \times SE, \hat{p} + 2 \times SE)$$

c(p_hat - 2 * SE, p_hat + 2 * SE)

0.7051883 0.8412784

Let's practice!

Inference for Categorical Data in R