Sampling in R
Richie Cotton
Data Evangelist at DataCamp
set.seed(19000113)
coffee_ratings %>%
slice_sample(n = 5)
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1 81.00 SL14 Uganda 7.33 6.92 7.17 7.42 7.42
2 85.00 Caturra Colombia 8.00 7.92 7.75 7.75 7.83
3 85.25 Bourbon Guatemala 8.00 7.92 7.75 7.92 7.83
4 81.42 Catuai Guatemala 7.42 7.33 7.08 7.33 7.25
5 82.75 Caturra Honduras 7.58 7.50 7.42 7.50 7.50
library(tibble)
coffee_ratings <- coffee_ratings %>%
rowid_to_column()
# A tibble: 1,338 x 9
rowid total_cup_points variety country_of_origin aroma flavor aftertaste body balance
<int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 90.6 NA Ethiopia 8.67 8.83 8.67 8.5 8.42
2 2 89.9 Other Ethiopia 8.75 8.67 8.5 8.42 8.42
3 3 89.8 Bourbon Guatemala 8.42 8.5 8.42 8.33 8.42
4 4 89 NA Ethiopia 8.17 8.58 8.42 8.5 8.25
5 5 88.8 Other Ethiopia 8.25 8.5 8.25 8.42 8.33
...
sample_size <- 5
pop_size <- nrow(coffee_ratings)
1338
interval <- pop_size %/% sample_size
267
row_indexes <- seq_len(sample_size) * interval
267 534 801 1068 1335
coffee_ratings %>%
slice(row_indexes)
# A tibble: 5 x 9
rowid total_cup_points variety country_of_origin aroma flavor aftertaste body balance
<int> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 267 83.9 NA Colombia 7.92 7.67 7.5 7.58 7.67
2 534 82.9 Bourbon Brazil 7.67 7.58 7.5 7.58 7.5
3 801 82 Gesha Malawi 7.5 7.42 7.33 7.33 7.5
4 1068 80.6 NA Colombia 7.08 7.25 7 7.08 7.33
5 1335 78.1 NA Ecuador 7.5 7.67 7.75 5.17 5.25
coffee_ratings %>%
ggplot(aes(x = rowid, y = aftertaste)) +
geom_point() +
geom_smooth()
Systematic sampling is only safe if you don't see a pattern in this scatter plot.
shuffled <- coffee_ratings %>%
slice_sample(prop = 1) %>%
select(- rowid) %>%
rowid_to_column()
shuffled %>%
ggplot(aes(x = rowid, y = aftertaste)) +
geom_point() +
geom_smooth()
Shuffling rows + systematic sampling is the same as simple random sampling.
Sampling in R