Hypothesis testing for comparing two means via simulation

Inference for Numerical Data in R

Mine Cetinkaya-Rundel

Associate Professor of the Practice, Duke University

Motivation

Motivating question: Does a treatment using embryonic stem cells help improve heart function following a heart attack more so than traditional therapy?
Data: stem.cell data from the openintro package

library(openintro)
data(stem.cell)

   trmt   before   after 
1  ctrl    35.25   29.50
2  ctrl    36.50   29.50
3  ctrl    39.75   36.25
   ...      ...     ... 
n  esc     53.75   51.00

Analysis outline

Step 1. Calculate change for each sheep: difference between before and after heart pumping capacities for each sheep.

   trmt   before   after   change 
1  ctrl    35.25   29.50   ?
2  ctrl    36.50   29.50   ?
3  ctrl    39.75   36.25   ?
   ...      ...     ...   
n  esc     53.75   51.00   ?

Analysis outline

Step 2. Set the hypotheses:

$H_0: \mu_{esc} = \mu_{ctrl}$; There is no difference between average change in treatment and control groups.

$H_A: \mu_{esc} > \mu_{ctrl}$; There is a difference between average change in treatment and control groups.

Analysis outline

Step 3. Conduct the hypothesis test.

Write the values of change on 18 index cards.
(1) Shuffle the cards and randomly split them into two equal sized decks: treatment and control.
(2) Calculate and record the test statistic: difference in average change between treatment and control.
Repeat (1) and (2) many times to generate the sampling distribution.
Calculate p-value as the percentage of simulations where the test statistic is at least as extreme as the observed difference in sample means.

Hypothesis test: generate resamples

Use the infer package to conduct the test:

library(infer)

Hypothesis test: generate resamples

Start with the data frame and specify the model:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  ...

Hypothesis test: generate resamples

Declare null hypothesis, i.e. no difference between means:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  hypothesize(null = __) %>%         # "independence" or "point"
  ...

Hypothesis test: generate resamples

Generate resamples assuming $H_0$ is true:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  hypothesize(null = __) %>%         # "independence" or "point"
  generate(reps = __, type = __) %>% # "bootstrap", "permute", or "simulate"
  ...

Hypothesis test: generate resamples

Calculate test statistic:

library(infer)

diff_ht_mean <- stem.cell %>%
  specify(__) %>%                    # y ~ x
  hypothesize(null = __) %>%         # "independence" or "point"
  generate(reps = _N_, type = __) %>%# "bootstrap", "permute", or "simulate"
  calculate(stat = "diff in means")  # type of statistic to calculate

Hypothesis test: calculate p-value

Calculate the p-value as the proportion of simulations where the simulated difference between the sample means is at least as extreme as the observed

$$P ((\bar{x}_{esc,sim} - \bar{x}_{ctrl,sim}) \ge (\bar{x}_{esc,obs} - \bar{x}_{ctrl,obs}))$$

Let's practice!

Inference for Numerical Data in R