Bootstrapping

Foundations of Inference in Python

Paul Savala

Assistant Professor of Mathematics

Bootstrapping

Bootstrapping = Sampling with replacement
1. Randomly choose a sample
2. Write it down
3. Put it back in the data (replacement)
4. Repeat
Bootstrapped sample = Sample generated from bootstrapping

Non-parametric confidence interval

Non-parametric analogue of stats.norm.interval
- Sample with replacement
- Compute test statistic
- Record it
- Repeat
Creates an empirical distribution

salaries_df['Years of Employment']

[6, 11, 14, 3, 2, ...]

sample_1 = salaries_df['Years of Employment'].sample(n=10)


print(max(sample_1) - min(sample_1))

Repeat this process many times
Middle 95% of outcomes = 95% bootstrapped confidence interval

# Statistic function
def max_min(x):
    return max(x) - min(x)


# Data as a tuple
data = (salaries_df['Years of Employment'], )


bootstrap_ci = stats.bootstrap(data, max_min, 
                               vectorized=False,
                               n_resamples=1000)

print(bootstrap_ci)

BootstrapResult(confidence_interval=ConfidenceInterval(low=33.0, high=38.0),
standard_error=1.3843971812870597)

Normal confidence intervals

Requires data to be normally distributed
Computed based only on mean and standard error
Inference valid only for normal data
Very fast to compute

Bootstrap confidence intervals

Allows for any distribution
Computed directly from data by resampling
Inference valid for any data
Much slower to compute

Use cases for bootstrapping

When working with non-normal data
- Ranked data
- Skewed data
When normal confidence intervals return questionable values
Work with any statistic we like

Let's practice!

Foundations of Inference in Python