Cluster sampling

Campionamento in Python

James Chapman

Curriculum Manager, DataCamp

Stratified sampling vs. cluster sampling

Stratified sampling

  • Split the population into subgroups
  • Use simple random sampling on every subgroup

Cluster sampling

  • Use simple random sampling to pick some subgroups
  • Use simple random sampling on only those subgroups
Campionamento in Python

Varieties of coffee

Coffee beans arranged in rows and columns.

varieties_pop = list(coffee_ratings['variety'].unique())
[None, 'Other', 'Bourbon', 'Catimor', 
'Ethiopian Yirgacheffe','Caturra', 
'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai',
'Pacamara', 'Typica', 'Sumatra Lintong',
'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
'Mandheling', 'Ruiru 11', 'Arusha',
'Ethiopian Heirlooms', 'Moka Peaberry',
'Sulawesi', 'Blue Mountain', 'Marigojipe', 
'Pache Comun']
Campionamento in Python

Stage 1: sampling for subgroups

Coffee beans arranged in rows and columns, all of which are grayed out save for three.

import random
varieties_samp = random.sample(varieties_pop, k=3)
['Hawaiian Kona', 'Bourbon', 'SL28']
Campionamento in Python

Stage 2: sampling each group

variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]
coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()
coffee_ratings_cluster.groupby("variety")\
    .sample(n=5, random_state=2021)
Campionamento in Python

Stage 2 output

                    total_cup_points        variety       country_of_origin  ...
variety                                                                       
Bourbon       575              82.83        Bourbon               Guatemala   
              560              82.83        Bourbon               Guatemala   
              524              83.00        Bourbon               Guatemala   
              1140             79.83        Bourbon               Guatemala   
              318              83.67        Bourbon                  Brazil   
Hawaiian Kona 1291             73.67  Hawaiian Kona  United States (Hawaii)   
              1266             76.25  Hawaiian Kona  United States (Hawaii)   
              488              83.08  Hawaiian Kona  United States (Hawaii)   
              461              83.17  Hawaiian Kona  United States (Hawaii)   
              117              84.83  Hawaiian Kona  United States (Hawaii)   
SL28          137              84.67           SL28                   Kenya   
              452              83.17           SL28                   Kenya   
              224              84.17           SL28                   Kenya   
              66               85.50           SL28                   Kenya   
              559              82.83           SL28                   Kenya   
Campionamento in Python

Multistage sampling

  • Cluster sampling is a type of multistage sampling
  • Can have > 2 stages
  • E.g., countrywide surveys may sample states, counties, cities, and neighborhoods
Campionamento in Python

Let's practice!

Campionamento in Python

Preparing Video For Download...