Confidence intervals

Sampling in Python

James Chapman

Curriculum Manager, DataCamp

Confidence intervals

"Values within one standard deviation of the mean" includes a large number of values from each of these distributions
We'll define a related concept called a confidence interval

Predicting the weather

Rapid City, South Dakota in the United States has the least predictable weather
Our job is to predict the high temperature there tomorrow

A map of the weather, with colors indicating how predictable regions are.

Our weather prediction

Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)

We just reported a confidence interval!

40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]
... or, 47 ± 7°F
7°F is the margin of error

Bootstrap distribution of mean flavor

import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()

A histogram of mean coffee flavor.

Mean of the resamples

import numpy as np
np.mean(coffee_boot_distn)

7.513452892

A histogram of mean coffee flavor with the mean indicated by a vertical black bar.

Mean plus or minus one standard deviation

np.mean(coffee_boot_distn)

7.513452892

np.mean(coffee_boot_distn) - np.std(coffee_boot_distn, ddof=1)

7.497385709174466

np.mean(coffee_boot_distn) + np.std(coffee_boot_distn, ddof=1)

7.529520074825534

A histogram of coffee flavor means with mean and standard deviations indicated by vertical bars.

Quantile method for confidence intervals

np.quantile(coffee_boot_distn, 0.025)

7.4817195

np.quantile(coffee_boot_distn, 0.975)

7.5448805

A 95 percent confidence interval line.

Inverse cumulative distribution function

PDF: The bell curve
CDF: integrate to get area under bell curve
Inv. CDF: flip x and y axes

Implemented in Python with

from scipy.stats import norm
norm.ppf(quantile, loc=0, scale=1)

Inverse cumulative distribution function.

Standard error method for confidence interval

point_estimate = np.mean(coffee_boot_distn)

7.513452892

std_error = np.std(coffee_boot_distn, ddof=1)

0.016067182825533724

from scipy.stats import norm
lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)
print((lower, upper))

(7.481961792328933, 7.544943991671067)

Let's practice!

Sampling in Python