Simple random and systematic sampling

Sampling in R

Richie Cotton

Data Evangelist at DataCamp

Simple random sampling

A hand picking a folded piece of paper out of a raffle jar.

Lottery balls rolling.

Sampling in R

Simple random sampling of coffees

Coffee beans arranged in rows and columns.

Coffee beans arranged in rows and columns, some of which are grayed out.

Sampling in R

Simple random sampling in R

set.seed(19000113)
coffee_ratings %>% 
  slice_sample(n = 5)
  total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1            81.00    SL14            Uganda  7.33   6.92       7.17 7.42    7.42
2            85.00 Caturra          Colombia  8.00   7.92       7.75 7.75    7.83
3            85.25 Bourbon         Guatemala  8.00   7.92       7.75 7.92    7.83
4            81.42  Catuai         Guatemala  7.42   7.33       7.08 7.33    7.25
5            82.75 Caturra          Honduras  7.58   7.50       7.42 7.50    7.50
Sampling in R

Systematic sampling

Coffee beans arranged in rows and columns.

Coffee beans arranged in rows and columns, most of which are grayed out save for those on a diagonal line.

Sampling in R

Adding a row ID column

library(tibble)
coffee_ratings <- coffee_ratings %>%
  rowid_to_column()
# A tibble: 1,338 x 9
   rowid total_cup_points variety country_of_origin aroma flavor aftertaste  body balance
   <int>            <dbl> <chr>   <chr>             <dbl>  <dbl>      <dbl> <dbl>   <dbl>
 1     1             90.6 NA      Ethiopia           8.67   8.83       8.67  8.5     8.42
 2     2             89.9 Other   Ethiopia           8.75   8.67       8.5   8.42    8.42
 3     3             89.8 Bourbon Guatemala          8.42   8.5        8.42  8.33    8.42
 4     4             89   NA      Ethiopia           8.17   8.58       8.42  8.5     8.25
 5     5             88.8 Other   Ethiopia           8.25   8.5        8.25  8.42    8.33
...
Sampling in R

Systematic sampling in R

sample_size <- 5
pop_size <- nrow(coffee_ratings)
1338
interval <- pop_size %/% sample_size
267
Sampling in R

Systematic sampling in R 2

row_indexes <- seq_len(sample_size) * interval
267  534  801 1068 1335
coffee_ratings %>% 
  slice(row_indexes)
 # A tibble: 5 x 9
  rowid total_cup_points variety country_of_origin aroma flavor aftertaste  body balance
  <int>            <dbl> <chr>   <chr>             <dbl>  <dbl>      <dbl> <dbl>   <dbl>
1   267             83.9 NA      Colombia           7.92   7.67       7.5   7.58    7.67
2   534             82.9 Bourbon Brazil             7.67   7.58       7.5   7.58    7.5 
3   801             82   Gesha   Malawi             7.5    7.42       7.33  7.33    7.5 
4  1068             80.6 NA      Colombia           7.08   7.25       7     7.08    7.33
5  1335             78.1 NA      Ecuador            7.5    7.67       7.75  5.17    5.25
Sampling in R

The trouble with systematic sampling

coffee_ratings %>% 
  ggplot(aes(x = rowid, y = aftertaste)) +
  geom_point() +
  geom_smooth()

Systematic sampling is only safe if you don't see a pattern in this scatter plot.

Scatterplot of aftertaste scores versus rowids.

Sampling in R

Making systematic sampling safe

shuffled <- coffee_ratings %>%
  slice_sample(prop = 1) %>% 
  select(- rowid) %>% 
  rowid_to_column()
shuffled %>% 
  ggplot(aes(x = rowid, y = aftertaste)) +
  geom_point() +
  geom_smooth()

Shuffling rows + systematic sampling is the same as simple random sampling.

Scatterplot of aftertaste scores versus rowids after shuffling the dataset.

Sampling in R

Let's practice!

Sampling in R

Preparing Video For Download...