Stratified and weighted random sampling

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

Coffees by country

Coffee beans grouped by country of origin.

top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)

country_of_origin
Mexico                    236
Colombia                  183
Guatemala                 181
Brazil                    132
Taiwan                     75
United States (Hawaii)     73
dtype: int64

¹ The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.

Filtering for 6 countries

top_counted_countries = ["Mexico", "Colombia", "Guatemala",
  "Brazil", "Taiwan", "United States (Hawaii)"]


top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)


coffee_ratings_top = coffee_ratings[top_counted_subset]

Counts of a simple random sample

coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)

coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)

country_of_origin
Mexico                    0.250000
Guatemala                 0.204545
Colombia                  0.181818
Brazil                    0.181818
United States (Hawaii)    0.102273
Taiwan                    0.079545
dtype: float64

Comparing proportions

Population:

Mexico                    0.268182
Colombia                  0.207955
Guatemala                 0.205682
Brazil                    0.150000
Taiwan                    0.085227
United States (Hawaii)    0.082955
Name: country_of_origin, dtype: float64

10% simple random sample:

Mexico                    0.250000
Guatemala                 0.204545
Colombia                  0.181818
Brazil                    0.181818
United States (Hawaii)    0.102273
Taiwan                    0.079545
Name: country_of_origin, dtype: float64

Proportional stratified sampling

coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\

    .sample(frac=0.1, random_state=2021)

coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)

Mexico                    0.272727
Guatemala                 0.204545
Colombia                  0.204545
Brazil                    0.147727
Taiwan                    0.090909
United States (Hawaii)    0.079545
Name: country_of_origin, dtype: float64

Equal counts stratified sampling

coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
    .sample(n=15, random_state=2021)

coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)

Taiwan                    0.166667
Brazil                    0.166667
United States (Hawaii)    0.166667
Guatemala                 0.166667
Mexico                    0.166667
Colombia                  0.166667
Name: country_of_origin, dtype: float64

Weighted random sampling

Specify weights to adjust the relative probability of a row being sampled

import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"

coffee_ratings_weight['weight'] = np.where(condition, 2, 1)

coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")

Weighted random sampling results

10% weighted sample:

coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)

Brazil                    0.261364
Mexico                    0.204545
Guatemala                 0.204545
Taiwan                    0.170455
Colombia                  0.090909
United States (Hawaii)    0.068182
Name: country_of_origin, dtype: float64

Let's practice!

Sampling in Python