Sampling and point estimates

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

Estimating the population of France

A map of France.

A census asks every household how many people live there.

Sampling in Python

There are lots of people in France

A map of France with icons of people.

Censuses are really expensive!

Sampling in Python

Sampling households

A map of France with icons of people, some of which are highlighted.

Cheaper to ask a small number of households and use statistics to estimate the population

Working with a subset of the whole population is called sampling

Sampling in Python

Population vs. sample

The population is the complete dataset

  • Doesn't have to refer to people
  • Typically, don't know what the whole population is

 

The sample is the subset of data you calculate on

Sampling in Python

Coffee rating dataset

total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83

 

  • Each row represents 1 coffee
  • 1338 rows
  • We'll treat this as the population
Sampling in Python

Points vs. flavor: population

pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]
      total_cup_points  flavor
0                90.58    8.83
1                89.92    8.67
2                89.75    8.50
3                89.00    8.58
4                88.83    8.50
...                ...     ...
1333             78.75    7.58
1334             78.08    7.67
1335             77.17    7.33
1336             75.08    6.83
1337             73.75    6.67

[1338 rows x 2 columns]
Sampling in Python

Points vs. flavor: 10 row sample

pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)
      total_cup_points  flavor
1088             80.33    7.17
1157             79.67    7.42
1267             76.17    7.33
506              83.00    7.67
659              82.50    7.42
817              81.92    7.50
1050             80.67    7.42
685              82.42    7.50
1027             80.92    7.25
62               85.58    8.17

[10 rows x 2 columns]
Sampling in Python

Python sampling for Series

  • Use .sample() for pandas DataFrames and Series
cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
1088    80.33
1157    79.67
1267    76.17
...     ... 
685     82.42
1027    80.92
62      85.58
Name: total_cup_points, dtype: float64
Sampling in Python

Population parameters & point estimates

A population parameter is a calculation made on the population dataset

import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])
82.15120328849028

A point estimate or sample statistic is a calculation made on the sample dataset

np.mean(cup_points_samp)
81.31800000000001
Sampling in Python

Point estimates with pandas

pts_vs_flavor_pop['flavor'].mean()
7.526046337817639
pts_vs_flavor_samp['flavor'].mean()
7.485000000000001
Sampling in Python

Let's practice!

Sampling in Python

Preparing Video For Download...