Introduction to bootstrapping

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

With or without

Sampling without replacement:

Playing cards on a casino table.

Sampling with replacement ("resampling"):

Four rolling dice.

Sampling in Python

Simple random sampling without replacement

Population:

Coffee beans arranged in rows and columns.

Sample:

Coffee beans arranged in rows and columns, most of which are grayed out.

Sampling in Python

Simple random sampling with replacement

Population:

Coffee beans arranged in rows and columns.

Resample:

A random sample of coffee beans, some of which are duplicates.

Sampling in Python

Why sample with replacement?

  • coffee_ratings: a sample of a larger population of all coffees
  • Each coffee in our sample represents many different hypothetical population coffees
  • Sampling with replacement is a proxy
Sampling in Python

Coffee data preparation

coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()
      index  variety country_of_origin  flavor
0         0     None          Ethiopia    8.83
1         1    Other          Ethiopia    8.67
2         2  Bourbon         Guatemala    8.50
3         3     None          Ethiopia    8.58
4         4    Other          Ethiopia    8.50
...     ...      ...               ...     ...
1333   1333     None           Ecuador    7.58
1334   1334     None           Ecuador    7.67
1335   1335     None     United States    7.33
1336   1336     None             India    6.83
1337   1337     None           Vietnam    6.67

[1338 rows x 4 columns]
Sampling in Python

Resampling with .sample()

coffee_resamp = coffee_focus.sample(frac=1, replace=True)
      index  variety country_of_origin  flavor
1140   1140  Bourbon         Guatemala    7.25
57       57  Bourbon         Guatemala    8.00
1152   1152  Bourbon            Mexico    7.08
621     621  Caturra          Thailand    7.50
44       44     SL28             Kenya    8.08
...     ...      ...               ...     ...
996     996   Typica            Mexico    7.33
1090   1090  Bourbon         Guatemala    7.33
918     918    Other         Guatemala    7.42
249     249  Caturra          Colombia    7.67
467     467  Caturra          Colombia    7.50

[1338 rows x 4 columns]
Sampling in Python

Repeated coffees

coffee_resamp["index"].value_counts()
658     5
167     4
363     4
357     4
1047    4
       ..
771     1
770     1
766     1
764     1
0       1
Name: index, Length: 868, dtype: int64
Sampling in Python

Missing coffees

num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))
868
len(coffee_ratings) - num_unique_coffees
470
Sampling in Python

Bootstrapping

The opposite of sampling from a population

Sampling: going from a population to a smaller sample

Bootstrapping: building up a theoretical population from the sample

Bootstrapping use case:

  • Develop understanding of sampling variability using a single sample

A cowboy boot.

Sampling in Python

Bootstrapping process

  1. Make a resample of the same size as the original sample
  2. Calculate the statistic of interest for this bootstrap sample
  3. Repeat steps 1 and 2 many times

The resulting statistics are bootstrap statistics, and they form a bootstrap distribution

Sampling in Python

Bootstrapping coffee mean flavor

import numpy as np

mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
Sampling in Python

Bootstrap distribution histogram

import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()

Bootstrap distribution of the mean flavor

Sampling in Python

Let's practice!

Sampling in Python

Preparing Video For Download...