The General Social Survey

Inference for Categorical Data in R

Andrew Bray

Assistant Professor of Statistics at Reed College

ex-1-1 copy.002.png

Inference for Categorical Data in R

ex-1-1 copy.003.png

Inference for Categorical Data in R

ex-1-1 copy.004.png

Inference for Categorical Data in R

ex-1-1 copy.005.png

Inference for Categorical Data in R

ex-1-1 copy.006.png

Inference for Categorical Data in R

Exploring GSS

library(dplyr)
glimpse(gss)
Observations: 3,300
Variables: 25
$ id       <dbl> 518, 1092, 2094, 229, 979, 554, 491, 319, 3143, 1...
$ year     <dbl> 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1982, 1...
$ age      <fct> 49, 22, 26, 75, 71, 33, 56, 33, 69, 40, 44, 42, 5...
$ class    <fct> WORKING CLASS, WORKING CLASS, WORKING CLASS, LOWE...
$ degree   <fct> HIGH SCHOOL, HIGH SCHOOL, HIGH SCHOOL, LT HIGH SC...
$ sex      <fct> MALE, MALE, MALE, MALE, FEMALE, FEMALE, MALE, FEM...
$ happy    <fct> HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, HAPPY, ...
Inference for Categorical Data in R

Exploring GSS

gss2016 <- filter(gss, year == 2016)
ggplot(gss2016, aes(x = happy)) +
  geom_bar()

ch1v1-happy-barplot-a.png

Inference for Categorical Data in R

Exploring GSS

gss2016 <- filter(gss, year == 2016)
ggplot(gss2016, aes(x = happy)) +
  geom_bar()

ch1v1-happy-barplot-b.png

Inference for Categorical Data in R

Exploring GSS

p_hat <- gss2016 %>%
  summarize(prop_happy = mean(happy == "HAPPY")) %>%
  pull()
p_hat
0.7733333
Inference for Categorical Data in R

General 95% confidence interval

$$(\hat{p} - 2 \times SE, \hat{p} + 2 \times SE)$$

Sample proportion plus or minus two standard errors

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.016.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.017.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.018.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.019.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.020.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.021.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.022.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.023.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.024.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.025.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.026.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.027.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.028.png

Inference for Categorical Data in R

Bootstrap

ex-1-1 copy.029.png

Inference for Categorical Data in R

Bootstrap Confidence Interval

library(infer)
boot <- gss2016 %>%
  specify(response = happy, 
          success = “HAPPY”) %>%
  generate(reps = 500, 
           type = "bootstrap") %>%
  calculate(stat = "prop")
boot
Response: happy (factor)
# A tibble: 500 x 2
   replicate  stat
       <int> <dbl>
 1         1 0.827
 2         2 0.740
 3         3 0.780
 4         4 0.773
 5         5 0.747
 6         6 0.753
Inference for Categorical Data in R

Bootstrap Confidence Interval

ggplot(boot, aes(x = stat)) +
  geom_density()

ch1v1-happy-densityplot.png

Inference for Categorical Data in R

Bootstrap Confidence Interval

SE <- boot %>%
  summarize(sd(stat)) %>%
  pull()
SE
0.03482251

$$(\hat{p} - 2 \times SE, \hat{p} + 2 \times SE)$$

c(p_hat - 2 * SE, p_hat + 2 * SE)
0.7051883 0.8412784
Inference for Categorical Data in R

Let's practice!

Inference for Categorical Data in R

Preparing Video For Download...