Sampling and point estimates

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

Estimating the population of France

A map of France.

A census asks every household how many people live there.

There are lots of people in France

A map of France with icons of people.

Censuses are really expensive!

Sampling households

A map of France with icons of people, some of which are highlighted.

Cheaper to ask a small number of households and use statistics to estimate the population

Working with a subset of the whole population is called sampling

Population vs. sample

The population is the complete dataset

Doesn't have to refer to people
Typically, don't know what the whole population is

The sample is the subset of data you calculate on

Coffee rating dataset

total_cup_points	variety	country_of_origin	aroma	flavor	aftertaste	body	balance
90.58	NA	Ethiopia	8.67	8.83	8.67	8.50	8.42
89.92	Other	Ethiopia	8.75	8.67	8.50	8.42	8.42
...	...	...	...	...	...	...	...
73.75	NA	Vietnam	6.75	6.67	6.5	6.92	6.83

Each row represents 1 coffee
1338 rows
We'll treat this as the population

Points vs. flavor: population

pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]

      total_cup_points  flavor
0                90.58    8.83
1                89.92    8.67
2                89.75    8.50
3                89.00    8.58
4                88.83    8.50
...                ...     ...
1333             78.75    7.58
1334             78.08    7.67
1335             77.17    7.33
1336             75.08    6.83
1337             73.75    6.67

[1338 rows x 2 columns]

Points vs. flavor: 10 row sample

pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)

      total_cup_points  flavor
1088             80.33    7.17
1157             79.67    7.42
1267             76.17    7.33
506              83.00    7.67
659              82.50    7.42
817              81.92    7.50
1050             80.67    7.42
685              82.42    7.50
1027             80.92    7.25
62               85.58    8.17

[10 rows x 2 columns]

Python sampling for Series

Use .sample() for pandas DataFrames and Series

cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)

1088    80.33
1157    79.67
1267    76.17
...     ... 
685     82.42
1027    80.92
62      85.58
Name: total_cup_points, dtype: float64

Population parameters & point estimates

A population parameter is a calculation made on the population dataset

import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])

82.15120328849028

A point estimate or sample statistic is a calculation made on the sample dataset

np.mean(cup_points_samp)

81.31800000000001

Point estimates with pandas

pts_vs_flavor_pop['flavor'].mean()

7.526046337817639

pts_vs_flavor_samp['flavor'].mean()

7.485000000000001

Let's practice!

Sampling in Python