Sampling and point estimates

Campionamento in Python

James Chapman

Curriculum Manager, DataCamp

Estimating the population of France

A map of France.

A census asks every household how many people live there.

Campionamento in Python

There are lots of people in France

A map of France with icons of people.

Censuses are really expensive!

Campionamento in Python

Sampling households

A map of France with icons of people, some of which are highlighted.

Cheaper to ask a small number of households and use statistics to estimate the population

Working with a subset of the whole population is called sampling

Campionamento in Python

Population vs. sample

The population is the complete dataset

  • Doesn't have to refer to people
  • Typically, don't know what the whole population is

 

The sample is the subset of data you calculate on

Campionamento in Python

Coffee rating dataset

total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83

 

  • Each row represents 1 coffee
  • 1338 rows
  • We'll treat this as the population
Campionamento in Python

Points vs. flavor: population

pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]
      total_cup_points  flavor
0                90.58    8.83
1                89.92    8.67
2                89.75    8.50
3                89.00    8.58
4                88.83    8.50
...                ...     ...
1333             78.75    7.58
1334             78.08    7.67
1335             77.17    7.33
1336             75.08    6.83
1337             73.75    6.67

[1338 rows x 2 columns]
Campionamento in Python

Points vs. flavor: 10 row sample

pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)
      total_cup_points  flavor
1088             80.33    7.17
1157             79.67    7.42
1267             76.17    7.33
506              83.00    7.67
659              82.50    7.42
817              81.92    7.50
1050             80.67    7.42
685              82.42    7.50
1027             80.92    7.25
62               85.58    8.17

[10 rows x 2 columns]
Campionamento in Python

Python sampling for Series

  • Use .sample() for pandas DataFrames and Series
cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
1088    80.33
1157    79.67
1267    76.17
...     ... 
685     82.42
1027    80.92
62      85.58
Name: total_cup_points, dtype: float64
Campionamento in Python

Population parameters & point estimates

A population parameter is a calculation made on the population dataset

import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])
82.15120328849028

A point estimate or sample statistic is a calculation made on the sample dataset

np.mean(cup_points_samp)
81.31800000000001
Campionamento in Python

Point estimates with pandas

pts_vs_flavor_pop['flavor'].mean()
7.526046337817639
pts_vs_flavor_samp['flavor'].mean()
7.485000000000001
Campionamento in Python

Let's practice!

Campionamento in Python

Preparing Video For Download...