Sampling in Python
James Chapman
Curriculum Manager, DataCamp

top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)
country_of_origin
Mexico                    236
Colombia                  183
Guatemala                 181
Brazil                    132
Taiwan                     75
United States (Hawaii)     73
dtype: int64
top_counted_countries = ["Mexico", "Colombia", "Guatemala", "Brazil", "Taiwan", "United States (Hawaii)"]top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)coffee_ratings_top = coffee_ratings[top_counted_subset]
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)
coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)
country_of_origin
Mexico                    0.250000
Guatemala                 0.204545
Colombia                  0.181818
Brazil                    0.181818
United States (Hawaii)    0.102273
Taiwan                    0.079545
dtype: float64
Population:
Mexico                    0.268182
Colombia                  0.207955
Guatemala                 0.205682
Brazil                    0.150000
Taiwan                    0.085227
United States (Hawaii)    0.082955
Name: country_of_origin, dtype: float64
10% simple random sample:
Mexico                    0.250000
Guatemala                 0.204545
Colombia                  0.181818
Brazil                    0.181818
United States (Hawaii)    0.102273
Taiwan                    0.079545
Name: country_of_origin, dtype: float64
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\.sample(frac=0.1, random_state=2021)
coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)
Mexico                    0.272727
Guatemala                 0.204545
Colombia                  0.204545
Brazil                    0.147727
Taiwan                    0.090909
United States (Hawaii)    0.079545
Name: country_of_origin, dtype: float64
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
    .sample(n=15, random_state=2021)
coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)
Taiwan                    0.166667
Brazil                    0.166667
United States (Hawaii)    0.166667
Guatemala                 0.166667
Mexico                    0.166667
Colombia                  0.166667
Name: country_of_origin, dtype: float64
import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"
coffee_ratings_weight['weight'] = np.where(condition, 2, 1)
coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")
10% weighted sample:
coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)  
Brazil                    0.261364
Mexico                    0.204545
Guatemala                 0.204545
Taiwan                    0.170455
Colombia                  0.090909
United States (Hawaii)    0.068182
Name: country_of_origin, dtype: float64
Sampling in Python