Confidence intervals

Sampling in R

Richie Cotton

Data Evangelist at DataCamp

Confidence intervals

  • "Values within one standard deviation of the mean" includes a large number of values from each of these distributions.
  • We'll define a related concept called a confidence interval.
Sampling in R

Predicting the weather

  • Rapid City, South Dakota in the United States has the least predictable weather.
  • Your job is to predict the high temperature there tomorrow.

A map of the weather, with colors indicating how predictable regions are.

Sampling in R

Your weather prediction

  • point estimate = 47 °F (8.3 °C)
  • range of plausible high temperature values = 40 to 54 °F (4.4 to 12.8 °C)
Sampling in R

You just reported a confidence interval

  • 40 to 54 °F is a confidence interval
  • Sometimes written as 47 °F (40 °F, 54 °F) or 47 °F [40 °F, 54 °F]
  • ... or, 47 ± 7 °F
  • 7 °F is the margin of error
Sampling in R

Bootstrap distribution of mean flavor

ggplot(coffee_boot_distn, aes(resample_mean)) +
  geom_histogram(binwidth = 0.002)

A histogram of mean coffee flavor.

Sampling in R

Mean of the resamples

coffee_boot_distn %>% 
  summarize(
    mean_resample_mean = mean(resample_mean)
  )
# A tibble: 1 x 1
  mean_resample_mean
               <dbl>
1             7.5263

A histogram of mean coffee flavor with the mean indicated by a vertical blue bar.

Sampling in R

Mean plus or minus one standard deviation

coffee_boot_distn %>% 
  summarize(
    mean_resample_mean = mean(resample_mean),
    mean_minus_1sd = mean_resample_mean - sd(resample_mean),
    mean_plus_1sd = mean_resample_mean + sd(resample_mean)
  )
# A tibble: 1 x 3
  mean_resample_mean mean_plus_1sd mean_minus_1sd
               <dbl>         <dbl>          <dbl>
1             7.5263        7.5355         7.5171

A histogram of coffee flavor means with mean and standard deviations indicated by vertical bars.

Sampling in R

Quantile method for confidence intervals

coffee_boot_distn %>% 
  summarize(
    lower = quantile(resample_mean, 0.025),
    upper = quantile(resample_mean, 0.975)
  )
# A tibble: 1 x 2
   lower  upper
   <dbl>  <dbl>
1 7.5087 7.5447

A 95 percent confidence interval line.

Sampling in R

Inverse cumulative distribution function

  • PDF: The bell curve
  • CDF: integrate to get area under bell curve
  • Inv. CDF: flip x and y axes
normal_inv_cdf <- tibble(
  p = seq(-0.001, 0.999, 0.001),
  inv_cdf = qnorm(p)
)
ggplot(normal_inv_cdf, aes(p, inv_cdf)) +
  geom_line()

Inverse cumulative distribution function.

1 See "Introduction to Statistics in R", Ch3, "The Normal Distribution"
Sampling in R

Standard error method for confidence interval

coffee_boot_distn %>% 
  summarize(
    point_estimate = mean(resample_mean),
    std_error = sd(resample_mean),

lower = qnorm(0.025, point_estimate, std_error), upper = qnorm(0.975, point_estimate, std_error)
)
# A tibble: 1 x 4
  point_estimate std_error  lower  upper
           <dbl>     <dbl>  <dbl>  <dbl>
1         7.5263 0.0091815 7.5083 7.5443
Sampling in R

Let's practice!

Sampling in R

Preparing Video For Download...