Sampling in Python
James Chapman
Curriculum Manager, DataCamp
Sampling without replacement:
Sampling with replacement ("resampling"):
Population:
Sample:
Population:
Resample:
coffee_ratings
: a sample of a larger population of all coffeescoffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()
index variety country_of_origin flavor
0 0 None Ethiopia 8.83
1 1 Other Ethiopia 8.67
2 2 Bourbon Guatemala 8.50
3 3 None Ethiopia 8.58
4 4 Other Ethiopia 8.50
... ... ... ... ...
1333 1333 None Ecuador 7.58
1334 1334 None Ecuador 7.67
1335 1335 None United States 7.33
1336 1336 None India 6.83
1337 1337 None Vietnam 6.67
[1338 rows x 4 columns]
coffee_resamp = coffee_focus.sample(frac=1, replace=True)
index variety country_of_origin flavor
1140 1140 Bourbon Guatemala 7.25
57 57 Bourbon Guatemala 8.00
1152 1152 Bourbon Mexico 7.08
621 621 Caturra Thailand 7.50
44 44 SL28 Kenya 8.08
... ... ... ... ...
996 996 Typica Mexico 7.33
1090 1090 Bourbon Guatemala 7.33
918 918 Other Guatemala 7.42
249 249 Caturra Colombia 7.67
467 467 Caturra Colombia 7.50
[1338 rows x 4 columns]
coffee_resamp["index"].value_counts()
658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))
868
len(coffee_ratings) - num_unique_coffees
470
The opposite of sampling from a population
Sampling: going from a population to a smaller sample
Bootstrapping: building up a theoretical population from the sample
Bootstrapping use case:
The resulting statistics are bootstrap statistics, and they form a bootstrap distribution
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()
Sampling in Python