Sampling in R
Richie Cotton
Data Evangelist at DataCamp
Sampling without replacement
Sampling with replacement ("resampling")
Population
Sample
Population
Sample
coffee_ratings
data as being a sample of a larger population of all coffees.coffee_focus <- coffee_ratings %>%
select(variety, country_of_origin, flavor) %>%
rowid_to_column()
glimpse(coffee_focus)
Rows: 1,338
Columns: 4
$ rowid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
$ variety <chr> NA, "Other", "Bourbon", NA, "Other", NA, "Other", N...
$ country_of_origin <chr> "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia", "E...
$ flavor <dbl> 8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33, 8.6...
coffee_resamp <- coffee_focus %>%
slice_sample(prop = 1, replace = TRUE)
# A tibble: 1,338 x 4
rowid variety country_of_origin flavor
<int> <chr> <chr> <dbl>
1 1253 Bourbon Guatemala 6.92
2 186 Caturra Colombia 7.58
3 1185 Bourbon Guatemala 7.42
4 1273 NA Philippines 6.5
5 1042 Caturra Honduras 7.33
6 195 Caturra Guatemala 7.75
7 1219 Typica Mexico 7
8 952 Caturra Honduras 7.5
9 41 Caturra Thailand 8.33
10 460 Caturra Honduras 7.67
# ... with 1,328 more rows
coffee_resamp %>%
count(rowid, sort = TRUE)
# A tibble: 844 x 2
rowid n
<int> <int>
1 704 5
2 913 5
3 1070 5
4 16 4
5 180 4
6 230 4
7 234 4
8 342 4
9 354 4
10 423 4
# ... with 834 more rows
coffee_resamp %>%
summarize(
coffees_included = n_distinct(rowid),
coffees_not_included = n() - coffees_included
)
# A tibble: 1 x 2
coffees_included coffees_not_included
<int> <int>
1 844 494
The opposite of sampling from a population.
Sampling: going from a population to a smaller sample.
Bootstrapping: building up a theoretical population from your sample.
Bootstrapping use case
The resulting statistics are called bootstrap statistics and when viewed to see their variability a bootstrap distribution.
# Step 3. Repeat many times
mean_flavors_1000 <- replicate(
n = 1000,
expr = {
coffee_focus %>%
# Step 1. Resample
slice_sample(prop = 1, replace = TRUE) %>%
# Step 2. Calculate statistic
summarize(mean_flavor = mean(flavor, na.rm = TRUE)) %>%
pull(mean_flavor)
})
bootstrap_distn <- tibble(
resample_mean = mean_flavors_1000
)
ggplot(bootstrap_distn, aes(resample_mean)) +
geom_histogram(binwidth = 0.0025)
Sampling in R