Introduction to bootstrapping

Sampling in R

Richie Cotton

Data Evangelist at DataCamp

With or without

Sampling without replacement

Playing cards on a casino table.

Sampling with replacement ("resampling")

Four rolling dice.

Sampling in R

Simple random sampling without replacement

Population

Coffee beans arranged in rows and columns.

Sample

Coffee beans arranged in rows and columns, most of which are grayed out.

Sampling in R

Simple random sampling with replacement

Population

Coffee beans arranged in rows and columns.

Sample

A random sample of coffee beans, some of which are duplicates.

Sampling in R

Why sample with replacement?

  • Think of the coffee_ratings data as being a sample of a larger population of all coffees.
  • Think about each coffee in our sample as being representative of many different coffees that we don't have in our sample, but do exist in the population.
  • Sampling with replacement is a proxy for including different members of these groups in our sample.
Sampling in R

Coffee data preparation

coffee_focus <- coffee_ratings %>%
  select(variety, country_of_origin, flavor) %>%
  rowid_to_column()
glimpse(coffee_focus)
Rows: 1,338
Columns: 4
$ rowid             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ variety           <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other", N...
$ country_of_origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia", "E...
$ flavor            <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, 8.6...
Sampling in R

Resampling with slice_sample()

coffee_resamp <- coffee_focus %>%
  slice_sample(prop = 1, replace = TRUE)
# A tibble: 1,338 x 4
   rowid variety country_of_origin flavor
   <int> <chr>   <chr>              <dbl>
 1  1253 Bourbon Guatemala           6.92
 2   186 Caturra Colombia            7.58
 3  1185 Bourbon Guatemala           7.42
 4  1273 NA      Philippines         6.5 
 5  1042 Caturra Honduras            7.33
 6   195 Caturra Guatemala           7.75
 7  1219 Typica  Mexico              7   
 8   952 Caturra Honduras            7.5 
 9    41 Caturra Thailand            8.33
10   460 Caturra Honduras            7.67
# ... with 1,328 more rows
Sampling in R

Repeated coffees

coffee_resamp %>% 
  count(rowid, sort = TRUE)
# A tibble: 844 x 2
   rowid     n
   <int> <int>
 1   704     5
 2   913     5
 3  1070     5
 4    16     4
 5   180     4
 6   230     4
 7   234     4
 8   342     4
 9   354     4
10   423     4
# ... with 834 more rows
Sampling in R

Missing coffees

coffee_resamp %>% 
  summarize(
    coffees_included = n_distinct(rowid),
    coffees_not_included = n() - coffees_included
  )
# A tibble: 1 x 2
  coffees_included coffees_not_included
             <int>                <int>
1              844                  494
Sampling in R

Bootstrapping

The opposite of sampling from a population.

Sampling: going from a population to a smaller sample.

Bootstrapping: building up a theoretical population from your sample.

Bootstrapping use case

  • Develop understanding of sampling variability using a single sample.

A cowboy boot.

Sampling in R

Bootstrapping process

  1. Make a resample of the same size as the original sample.
  2. Calculate the statistic of interest for this bootstrap sample.
  3. Repeat steps 1 and 2 many times.

The resulting statistics are called bootstrap statistics and when viewed to see their variability a bootstrap distribution.

Sampling in R

Bootstrapping coffee mean flavor

# Step 3. Repeat many times
mean_flavors_1000 <- replicate(
  n = 1000,
  expr = {
    coffee_focus %>%
      # Step 1. Resample
      slice_sample(prop = 1, replace = TRUE) %>%
      # Step 2. Calculate statistic
      summarize(mean_flavor = mean(flavor, na.rm = TRUE)) %>% 
      pull(mean_flavor)
  })
Sampling in R

Bootstrap distribution histogram

bootstrap_distn <- tibble(
  resample_mean = mean_flavors_1000
)
ggplot(bootstrap_distn, aes(resample_mean)) +
  geom_histogram(binwidth = 0.0025)

A histogram of the bootstrap distribution of the sample mean.

Sampling in R

Let's practice!

Sampling in R

Preparing Video For Download...