Sampling in Python
James Chapman
Curriculum Manager, DataCamp
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]
coffee_ratings_top.shape
(880, 8)
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)
coffee_ratings_srs.shape
(293, 8)
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)
coffee_ratings_strat.shape
(293, 8)
import random top_countries_samp = random.sample(top_counted_countries, k=2) top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp) coffee_ratings_cluster = coffee_ratings_top[top_condition] coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\ .cat.remove_unused_categories()
coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\ .sample(n=len(coffee_ratings_top) // 6)
coffee_ratings_clust.shape
(292, 8)
coffee_ratings_top['total_cup_points'].mean()
81.94700000000002
coffee_ratings_srs['total_cup_points'].mean()
81.95982935153583
coffee_ratings_strat['total_cup_points'].mean()
81.92566552901025
coffee_ratings_clust['total_cup_points'].mean()
82.03246575342466
Population:
coffee_ratings_top.groupby("country_of_origin")\
['total_cup_points'].mean()
country_of_origin
Brazil 82.405909
Colombia 83.106557
Guatemala 81.846575
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64
Simple random sample:
coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean()
country_of_origin
Brazil 82.414878
Colombia 82.925536
Guatemala 82.045385
Mexico 81.100714
Taiwan 81.744333
United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64
Population:
coffee_ratings_top.groupby("country_of_origin")\
['total_cup_points'].mean()
country_of_origin
Brazil 82.405909
Colombia 83.106557
Guatemala 81.846575
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64
Stratified sample:
coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean()
country_of_origin
Brazil 82.499773
Colombia 83.288197
Guatemala 81.727667
Mexico 80.994684
Taiwan 81.846800
United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64
Population:
coffee_ratings_top.groupby("country_of_origin")\
['total_cup_points'].mean()
country_of_origin
Brazil 82.405909
Colombia 83.106557
Guatemala 81.846575
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64
Cluster sample:
coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean()
country_of_origin
Colombia 83.128904
Mexico 80.936027
Name: total_cup_points, dtype: float64
Sampling in Python