Sampling in Python
James Chapman
Curriculum Manager, DataCamp
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)
country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64
top_counted_countries = ["Mexico", "Colombia", "Guatemala", "Brazil", "Taiwan", "United States (Hawaii)"]
top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[top_counted_subset]
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)
coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)
country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64
Population:
Mexico 0.268182
Colombia 0.207955
Guatemala 0.205682
Brazil 0.150000
Taiwan 0.085227
United States (Hawaii) 0.082955
Name: country_of_origin, dtype: float64
10% simple random sample:
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
Name: country_of_origin, dtype: float64
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)
coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)
Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)
coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)
Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64
import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"
coffee_ratings_weight['weight'] = np.where(condition, 2, 1)
coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")
10% weighted sample:
coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)
Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64
Sampling in Python