Simple random and systematic sampling

Sampling in R

Richie Cotton

Data Evangelist at DataCamp

Simple random sampling

A hand picking a folded piece of paper out of a raffle jar.

Lottery balls rolling.

Simple random sampling of coffees

Coffee beans arranged in rows and columns.

Coffee beans arranged in rows and columns, some of which are grayed out.

Simple random sampling in R

set.seed(19000113)
coffee_ratings %>% 
  slice_sample(n = 5)

  total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1            81.00    SL14            Uganda  7.33   6.92       7.17 7.42    7.42
2            85.00 Caturra          Colombia  8.00   7.92       7.75 7.75    7.83
3            85.25 Bourbon         Guatemala  8.00   7.92       7.75 7.92    7.83
4            81.42  Catuai         Guatemala  7.42   7.33       7.08 7.33    7.25
5            82.75 Caturra          Honduras  7.58   7.50       7.42 7.50    7.50

Systematic sampling

Coffee beans arranged in rows and columns.

Coffee beans arranged in rows and columns, most of which are grayed out save for those on a diagonal line.

Adding a row ID column

library(tibble)
coffee_ratings <- coffee_ratings %>%
  rowid_to_column()

# A tibble: 1,338 x 9
   rowid total_cup_points variety country_of_origin aroma flavor aftertaste  body balance
   <int>            <dbl> <chr>   <chr>             <dbl>  <dbl>      <dbl> <dbl>   <dbl>
 1     1             90.6 NA      Ethiopia           8.67   8.83       8.67  8.5     8.42
 2     2             89.9 Other   Ethiopia           8.75   8.67       8.5   8.42    8.42
 3     3             89.8 Bourbon Guatemala          8.42   8.5        8.42  8.33    8.42
 4     4             89   NA      Ethiopia           8.17   8.58       8.42  8.5     8.25
 5     5             88.8 Other   Ethiopia           8.25   8.5        8.25  8.42    8.33
...

Systematic sampling in R

sample_size <- 5
pop_size <- nrow(coffee_ratings)

interval <- pop_size %/% sample_size

Systematic sampling in R 2

row_indexes <- seq_len(sample_size) * interval

267  534  801 1068 1335

coffee_ratings %>% 
  slice(row_indexes)

 # A tibble: 5 x 9
  rowid total_cup_points variety country_of_origin aroma flavor aftertaste  body balance
  <int>            <dbl> <chr>   <chr>             <dbl>  <dbl>      <dbl> <dbl>   <dbl>
1   267             83.9 NA      Colombia           7.92   7.67       7.5   7.58    7.67
2   534             82.9 Bourbon Brazil             7.67   7.58       7.5   7.58    7.5 
3   801             82   Gesha   Malawi             7.5    7.42       7.33  7.33    7.5 
4  1068             80.6 NA      Colombia           7.08   7.25       7     7.08    7.33
5  1335             78.1 NA      Ecuador            7.5    7.67       7.75  5.17    5.25

The trouble with systematic sampling

coffee_ratings %>% 
  ggplot(aes(x = rowid, y = aftertaste)) +
  geom_point() +
  geom_smooth()

Systematic sampling is only safe if you don't see a pattern in this scatter plot.

Scatterplot of aftertaste scores versus rowids.

Making systematic sampling safe

shuffled <- coffee_ratings %>%
  slice_sample(prop = 1) %>% 
  select(- rowid) %>% 
  rowid_to_column()

shuffled %>% 
  ggplot(aes(x = rowid, y = aftertaste)) +
  geom_point() +
  geom_smooth()

Shuffling rows + systematic sampling is the same as simple random sampling.

Scatterplot of aftertaste scores versus rowids after shuffling the dataset.

Let's practice!

Sampling in R