Inference for Numerical Data in R
Mine Cetinkaya-Rundel
Associate Professor of the Practice, Duke University
Motivating question: Does a treatment using embryonic stem cells help improve heart function following a heart attack more so than traditional therapy?
Data: stem.cell
data from the openintro
package
library(openintro)
data(stem.cell)
trmt before after
1 ctrl 35.25 29.50
2 ctrl 36.50 29.50
3 ctrl 39.75 36.25
... ... ...
n esc 53.75 51.00
Step 1. Calculate change
for each sheep: difference between before and after heart pumping capacities for each sheep.
trmt before after change
1 ctrl 35.25 29.50 ?
2 ctrl 36.50 29.50 ?
3 ctrl 39.75 36.25 ?
... ... ...
n esc 53.75 51.00 ?
Step 2. Set the hypotheses:
$H_0: \mu_{esc} = \mu_{ctrl}$; There is no difference between average change in treatment and control groups.
$H_A: \mu_{esc} > \mu_{ctrl}$; There is a difference between average change in treatment and control groups.
Step 3. Conduct the hypothesis test.
change
on 18 index cards.change
between treatment and control.Use the infer
package to conduct the test:
library(infer)
Start with the data frame and specify the model:
library(infer)
diff_ht_mean <- stem.cell %>%
specify(__) %>% # y ~ x
...
Declare null hypothesis, i.e. no difference between means:
library(infer)
diff_ht_mean <- stem.cell %>%
specify(__) %>% # y ~ x
hypothesize(null = __) %>% # "independence" or "point"
...
Generate resamples assuming $H_0$ is true:
library(infer)
diff_ht_mean <- stem.cell %>%
specify(__) %>% # y ~ x
hypothesize(null = __) %>% # "independence" or "point"
generate(reps = __, type = __) %>% # "bootstrap", "permute", or "simulate"
...
Calculate test statistic:
library(infer)
diff_ht_mean <- stem.cell %>%
specify(__) %>% # y ~ x
hypothesize(null = __) %>% # "independence" or "point"
generate(reps = _N_, type = __) %>%# "bootstrap", "permute", or "simulate"
calculate(stat = "diff in means") # type of statistic to calculate
Calculate the p-value as the proportion of simulations where the simulated difference between the sample means is at least as extreme as the observed
$$P ((\bar{x}_{esc,sim} - \bar{x}_{ctrl,sim}) \ge (\bar{x}_{esc,obs} - \bar{x}_{ctrl,obs}))$$
Inference for Numerical Data in R