Hypothesis Testing in R
Richie Cotton
Data Evangelist at DataCamp
$p$: population proportion (unknown population parameter)
$\hat{p}$: sample proportion (sample statistic)
$p_{0}$: hypothesized population proportion
$$ z = \frac{\hat{p} - \text{mean}(\hat{p})}{\text{standard error}(\hat{p})} = \frac{\hat{p} - p}{\text{standard error}(\hat{p})} $$
Assuming $H_{0}$ is true, $p = p_{0}$, so
$$ z = \dfrac{\hat{p} - p_{0}}{\text{standard error}(\hat{p})} $$
$SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) \approx \sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}$
$SE_{\hat{p}} = \sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}$
Assuming $H_{0}$ is true,
$z = \dfrac{\hat{p} - p_{0}}{\sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}}$
This only uses sample information ($\hat{p}$ and $n$) and the hypothesized parameter ($p_{0}$).
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$
$H_{0}$: The proportion of SO users under thirty is equal to 0.5.
$H_{A}$: The proportion of SO users under thirty is not equal to 0.5.
alpha <- 0.01
stack_overflow %>%
count(age_cat)
# A tibble: 2 x 2
age_cat n
<chr> <int>
1 At least 30 1050
2 Under 30 1216
p_hat <- stack_overflow %>%
summarize(prop_under_30 = mean(age_cat == "Under 30")) %>%
pull(prop_under_30)
0.5366
p_0 <- 0.50
n <- nrow(stack_overflow)
2266
$z = \dfrac{\hat{p} - p_{0}}{\sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}}$
numerator <- p_hat - p_0
denominator <- sqrt(p_0 * (1 - p_0) / n)
z_score <- numerator / denominator
3.487
Left-tailed ("less than")
p_value <- pnorm(z_score)
Right-tailed ("greater than")
p_value <- pnorm(z_score, lower.tail = FALSE)
Two-tailed ("not equal")
p_value <- pnorm(z_score) +
pnorm(z_score, lower.tail = FALSE)
p_value <- 2 * pnorm(z_score)
0.000244
p_value <= alpha
TRUE
Hypothesis Testing in R