Comparing sampling and bootstrap distributions

Sampling in R

Richie Cotton

Data Evangelist at DataCamp

Coffee focused subset

set.seed(19790801)
coffee_sample <- coffee_ratings %>%
  select(variety, country_of_origin, flavor) %>%
  rowid_to_column() %>% 
  slice_sample(n = 500)
glimpse(coffee_sample)
Rows: 500
Columns: 4
$ rowid             <int> 10, 278, 458, 622, 131, 385, 1292, 47, 904, 1020, 5...
$ variety           <chr> "Other", "Bourbon", NA, "Caturra", "Caturra", "Yell...
$ country_of_origin <chr> "Ethiopia", "Guatemala", "Colombia", "Thailand", "C...
$ flavor            <dbl> 8.58, 7.75, 7.75, 7.50, 8.00, 7.83, 7.17, 8.08, 7.3...
Sampling in R

The bootstrap of mean coffee flavors

mean_flavors_1000 <- replicate(
  n = 1000,
  expr = coffee_sample %>%
    slice_sample(prop = 1, replace = TRUE) %>%
    summarize(mean_flavor = mean(flavor, na.rm = TRUE)) %>%
    pull(mean_flavor)
)
bootstrap_distn <- tibble(
  resample_mean = mean_flavors_1000
)
Sampling in R

Mean flavor bootstrap distribution

 ggplot(bootstrap_distn, aes(resample_mean)) +
  geom_histogram(binwidth = 0.0025)

A histogram of the bootstrap distribution.

Sampling in R

Sample, bootstrap distribution, population means

Sample mean

coffee_sample %>% 
  summarize(mean_flavor = mean(flavor)) %>% 
  pull(mean_flavor)
7.5163

Estimated population mean

bootstrap_distn %>% 
  summarize(mean_mean_flavor = mean(resample_mean)) %>% 
  pull(mean_mean_flavor)
7.5167

True population mean

coffee_ratings %>% 
  summarize(mean_flavor = mean(flavor)) %>% 
  pull(mean_flavor)
7.5260
Sampling in R

Interpreting the means

  • The bootstrap distribution mean is usually almost identical to the sample mean.
  • It may not be a good estimate of the population mean.
  • Bootstrapping cannot correct biases due to differences between your sample and the population.
Sampling in R

Sample sd vs bootstrap distribution sd

Sample standard deviation

coffee_focus %>% 
  summarize(sd_flavor = sd(flavor)) %>% 
  pull(sd_flavor)
0.3525

Estimated population standard deviation?

bootstrap_distn %>% 
  summarize(sd_mean_flavor = sd(resample_mean)) %>% 
  pull(sd_mean_flavor)
0.01572
Sampling in R

Sample, bootstrap dist'n, pop'n standard deviations

Sample standard deviation

coffee_focus %>% 
  summarize(sd_flavor = sd(flavor)) %>% 
  pull(sd_flavor)
0.3525

Estimated population standard deviation

standard_error <- bootstrap_distn %>%
  summarize(sd_mean_flavor = sd(resample_mean)) %>% 
  pull(sd_mean_flavor)
standard_error * sqrt(500)
0.3515

True standard deviation

coffee_ratings %>%
  summarize(sd_flavor = sd(flavor)) %>%
  pull(sd_flavor)
0.3414

Standard error is the standard deviation of the statistic of interest.

Standard error times square root of sample size estimates the population standard deviation.

Sampling in R

Interpreting the standard errors

  • Estimated standard error is the standard deviation of the bootstrap distribution for a sample statistic.
  • The bootstrap distribution standard error times the square root of the sample size estimates the standard deviation in the population.
Sampling in R

Let's practice!

Sampling in R

Preparing Video For Download...