Introduction to bootstrapping

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

With or without

Sampling without replacement:

Playing cards on a casino table.

Sampling with replacement ("resampling"):

Four rolling dice.

Simple random sampling without replacement

Population:

Coffee beans arranged in rows and columns.

Sample:

Coffee beans arranged in rows and columns, most of which are grayed out.

Simple random sampling with replacement

Population:

Coffee beans arranged in rows and columns.

Resample:

A random sample of coffee beans, some of which are duplicates.

Why sample with replacement?

coffee_ratings: a sample of a larger population of all coffees
Each coffee in our sample represents many different hypothetical population coffees
Sampling with replacement is a proxy

Coffee data preparation

coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()

      index  variety country_of_origin  flavor
0         0     None          Ethiopia    8.83
1         1    Other          Ethiopia    8.67
2         2  Bourbon         Guatemala    8.50
3         3     None          Ethiopia    8.58
4         4    Other          Ethiopia    8.50
...     ...      ...               ...     ...
1333   1333     None           Ecuador    7.58
1334   1334     None           Ecuador    7.67
1335   1335     None     United States    7.33
1336   1336     None             India    6.83
1337   1337     None           Vietnam    6.67

[1338 rows x 4 columns]

Resampling with .sample()

coffee_resamp = coffee_focus.sample(frac=1, replace=True)

      index  variety country_of_origin  flavor
1140   1140  Bourbon         Guatemala    7.25
57       57  Bourbon         Guatemala    8.00
1152   1152  Bourbon            Mexico    7.08
621     621  Caturra          Thailand    7.50
44       44     SL28             Kenya    8.08
...     ...      ...               ...     ...
996     996   Typica            Mexico    7.33
1090   1090  Bourbon         Guatemala    7.33
918     918    Other         Guatemala    7.42
249     249  Caturra          Colombia    7.67
467     467  Caturra          Colombia    7.50

[1338 rows x 4 columns]

Repeated coffees

coffee_resamp["index"].value_counts()

658     5
167     4
363     4
357     4
1047    4
       ..
771     1
770     1
766     1
764     1
0       1
Name: index, Length: 868, dtype: int64

Missing coffees

num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))

len(coffee_ratings) - num_unique_coffees

Bootstrapping

The opposite of sampling from a population

Sampling: going from a population to a smaller sample

Bootstrapping: building up a theoretical population from the sample

Bootstrapping use case:

Develop understanding of sampling variability using a single sample

A cowboy boot.

Bootstrapping process

Make a resample of the same size as the original sample
Calculate the statistic of interest for this bootstrap sample
Repeat steps 1 and 2 many times

The resulting statistics are bootstrap statistics, and they form a bootstrap distribution

Bootstrapping coffee mean flavor

import numpy as np

mean_flavors_1000 = []

for i in range(1000):

    mean_flavors_1000.append(

        np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])

    )

Bootstrap distribution histogram

import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()

Bootstrap distribution of the mean flavor

Let's practice!

Sampling in Python