Introduction to bootstrapping

Sampling in R

Richie Cotton

Data Evangelist at DataCamp

With or without

Sampling without replacement

Playing cards on a casino table.

Sampling with replacement ("resampling")

Four rolling dice.

Simple random sampling without replacement

Population

Coffee beans arranged in rows and columns.

Sample

Coffee beans arranged in rows and columns, most of which are grayed out.

Simple random sampling with replacement

Population

Coffee beans arranged in rows and columns.

Sample

A random sample of coffee beans, some of which are duplicates.

Why sample with replacement?

Think of the coffee_ratings data as being a sample of a larger population of all coffees.
Think about each coffee in our sample as being representative of many different coffees that we don't have in our sample, but do exist in the population.
Sampling with replacement is a proxy for including different members of these groups in our sample.

Coffee data preparation

coffee_focus <- coffee_ratings %>%
  select(variety, country_of_origin, flavor) %>%
  rowid_to_column()

glimpse(coffee_focus)

Rows: 1,338
Columns: 4
$ rowid             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ variety           <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other", N...
$ country_of_origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia", "E...
$ flavor            <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, 8.6...

Resampling with slice_sample()

coffee_resamp <- coffee_focus %>%
  slice_sample(prop = 1, replace = TRUE)

# A tibble: 1,338 x 4
   rowid variety country_of_origin flavor
   <int> <chr>   <chr>              <dbl>
 1  1253 Bourbon Guatemala           6.92
 2   186 Caturra Colombia            7.58
 3  1185 Bourbon Guatemala           7.42
 4  1273 NA      Philippines         6.5 
 5  1042 Caturra Honduras            7.33
 6   195 Caturra Guatemala           7.75
 7  1219 Typica  Mexico              7   
 8   952 Caturra Honduras            7.5 
 9    41 Caturra Thailand            8.33
10   460 Caturra Honduras            7.67
# ... with 1,328 more rows

Repeated coffees

coffee_resamp %>% 
  count(rowid, sort = TRUE)

# A tibble: 844 x 2
   rowid     n
   <int> <int>
 1   704     5
 2   913     5
 3  1070     5
 4    16     4
 5   180     4
 6   230     4
 7   234     4
 8   342     4
 9   354     4
10   423     4
# ... with 834 more rows

Missing coffees

coffee_resamp %>% 
  summarize(
    coffees_included = n_distinct(rowid),
    coffees_not_included = n() - coffees_included
  )

# A tibble: 1 x 2
  coffees_included coffees_not_included
             <int>                <int>
1              844                  494

Bootstrapping

The opposite of sampling from a population.

Sampling: going from a population to a smaller sample.

Bootstrapping: building up a theoretical population from your sample.

Bootstrapping use case

Develop understanding of sampling variability using a single sample.

A cowboy boot.

Bootstrapping process

Make a resample of the same size as the original sample.
Calculate the statistic of interest for this bootstrap sample.
Repeat steps 1 and 2 many times.

The resulting statistics are called bootstrap statistics and when viewed to see their variability a bootstrap distribution.

Bootstrapping coffee mean flavor

# Step 3. Repeat many times
mean_flavors_1000 <- replicate(
  n = 1000,
  expr = {

    coffee_focus %>%
      # Step 1. Resample
      slice_sample(prop = 1, replace = TRUE) %>%

      # Step 2. Calculate statistic
      summarize(mean_flavor = mean(flavor, na.rm = TRUE)) %>% 
      pull(mean_flavor)

})

Bootstrap distribution histogram

bootstrap_distn <- tibble(
  resample_mean = mean_flavors_1000
)

ggplot(bootstrap_distn, aes(resample_mean)) +
  geom_histogram(binwidth = 0.0025)

A histogram of the bootstrap distribution of the sample mean.

Let's practice!

Sampling in R