Sampling in Python
James Chapman
Curriculum Manager, DataCamp
A census asks every household how many people live there.
Censuses are really expensive!
Cheaper to ask a small number of households and use statistics to estimate the population
Working with a subset of the whole population is called sampling
The population is the complete dataset
The sample is the subset of data you calculate on
total_cup_points | variety | country_of_origin | aroma | flavor | aftertaste | body | balance |
---|---|---|---|---|---|---|---|
90.58 | NA | Ethiopia | 8.67 | 8.83 | 8.67 | 8.50 | 8.42 |
89.92 | Other | Ethiopia | 8.75 | 8.67 | 8.50 | 8.42 | 8.42 |
... | ... | ... | ... | ... | ... | ... | ... |
73.75 | NA | Vietnam | 6.75 | 6.67 | 6.5 | 6.92 | 6.83 |
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]
total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67
[1338 rows x 2 columns]
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)
total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17
[10 rows x 2 columns]
.sample()
for pandas
DataFrames and Seriescup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64
A population parameter is a calculation made on the population dataset
import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])
82.15120328849028
A point estimate or sample statistic is a calculation made on the sample dataset
np.mean(cup_points_samp)
81.31800000000001
pts_vs_flavor_pop['flavor'].mean()
7.526046337817639
pts_vs_flavor_samp['flavor'].mean()
7.485000000000001
Sampling in Python