Comparing sampling and bootstrap distributions

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

Coffee focused subset

coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
    .reset_index().sample(n=500)
     index         variety       country_of_origin  flavor
132    132           Other              Costa Rica    7.58
51      51            None  United States (Hawaii)    8.17
42      42  Yellow Bourbon                  Brazil    7.92
569    569         Bourbon               Guatemala    7.67
..     ...             ...                     ...     ...
643    643          Catuai              Costa Rica    7.42
356    356         Caturra                Colombia    7.58
494    494            None               Indonesia    7.58
169    169            None                  Brazil    7.81

[500 rows x 4 columns]
Sampling in Python

The bootstrap of mean coffee flavors

import numpy as np
mean_flavors_5000 = []
for i in range(5000):
    mean_flavors_5000.append(
        np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
    )
bootstrap_distn = mean_flavors_5000
Sampling in Python

Mean flavor bootstrap distribution

import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()

A histogram of the bootstrap distribution.

Sampling in Python

Sample, bootstrap distribution, population means

Sample mean:

coffee_sample['flavor'].mean()
7.5132200000000005

Estimated population mean:

np.mean(bootstrap_distn)
7.513357731999999

True population mean:

coffee_ratings['flavor'].mean()
7.526046337817639
Sampling in Python

Interpreting the means

Bootstrap distribution mean:

  • Usually close to the sample mean
  • May not be a good estimate of the population mean

  Bootstrapping cannot correct biases from sampling

Sampling in Python

Sample sd vs. bootstrap distribution sd

Sample standard deviation:

coffee_sample['flavor'].std()
0.3540883911928703

Estimated population standard deviation?

np.std(bootstrap_distn, ddof=1)
0.015768474367958217
Sampling in Python

Sample, bootstrap dist'n, pop'n standard deviations

Sample standard deviation:

coffee_sample['flavor'].std()
0.3540883911928703

Estimated population standard deviation:

standard_error = np.std(bootstrap_distn, ddof=1)

Standard error is the standard deviation of the statistic of interest

True standard deviation:

coffee_ratings['flavor'].std(ddof=0)
0.34125481224622645
standard_error * np.sqrt(500)
0.3525938058821761

Standard error times square root of sample size estimates the population standard deviation

Sampling in Python

Interpreting the standard errors

  • Estimated standard error → standard deviation of the bootstrap distribution for a sample statistic
  • $\text{Population std. dev} \approx \text{Std. Error} \times \sqrt{\text{Sample size}}$
Sampling in Python

Let's practice!

Sampling in Python

Preparing Video For Download...