Hypothesis testing for comparing two means via simulation

Inference for Numerical Data in R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University

Motivation

  • Motivating question: Does a treatment using embryonic stem cells help improve heart function following a heart attack more so than traditional therapy?

  • Data: stem.cell data from the openintro package

library(openintro)
data(stem.cell)
   trmt   before   after 
1  ctrl    35.25   29.50
2  ctrl    36.50   29.50
3  ctrl    39.75   36.25
   ...      ...     ... 
n  esc     53.75   51.00
Inference for Numerical Data in R

Analysis outline

Step 1. Calculate change for each sheep: difference between before and after heart pumping capacities for each sheep.

   trmt   before   after   change 
1  ctrl    35.25   29.50   ?
2  ctrl    36.50   29.50   ?
3  ctrl    39.75   36.25   ?
   ...      ...     ...   
n  esc     53.75   51.00   ?
Inference for Numerical Data in R

Analysis outline

Step 2. Set the hypotheses:

$H_0: \mu_{esc} = \mu_{ctrl}$; There is no difference between average change in treatment and control groups.

$H_A: \mu_{esc} > \mu_{ctrl}$; There is a difference between average change in treatment and control groups.

Inference for Numerical Data in R

Analysis outline

Step 3. Conduct the hypothesis test.

  • Write the values of change on 18 index cards.
  • (1) Shuffle the cards and randomly split them into two equal sized decks: treatment and control.
  • (2) Calculate and record the test statistic: difference in average change between treatment and control.
  • Repeat (1) and (2) many times to generate the sampling distribution.
  • Calculate p-value as the percentage of simulations where the test statistic is at least as extreme as the observed difference in sample means.
Inference for Numerical Data in R

Hypothesis test: generate resamples

Use the infer package to conduct the test:

library(infer)
Inference for Numerical Data in R

Hypothesis test: generate resamples

Start with the data frame and specify the model:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  ...
Inference for Numerical Data in R

Hypothesis test: generate resamples

Declare null hypothesis, i.e. no difference between means:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  hypothesize(null = __) %>%         # "independence" or "point"
  ...
Inference for Numerical Data in R

Hypothesis test: generate resamples

Generate resamples assuming $H_0$ is true:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  hypothesize(null = __) %>%         # "independence" or "point"
  generate(reps = __, type = __) %>% # "bootstrap", "permute", or "simulate"
  ...
Inference for Numerical Data in R

Hypothesis test: generate resamples

Calculate test statistic:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  hypothesize(null = __) %>%         # "independence" or "point"
  generate(reps = _N_, type = __) %>%# "bootstrap", "permute", or "simulate"
  calculate(stat = "diff in means")  # type of statistic to calculate
Inference for Numerical Data in R

Hypothesis test: calculate p-value

Calculate the p-value as the proportion of simulations where the simulated difference between the sample means is at least as extreme as the observed

$$P ((\bar{x}_{esc,sim} - \bar{x}_{ctrl,sim}) \ge (\bar{x}_{esc,obs} - \bar{x}_{ctrl,obs}))$$

Inference for Numerical Data in R

Let's practice!

Inference for Numerical Data in R

Preparing Video For Download...