Hypothesis Testing in R
Richie Cotton
Data Evangelist at DataCamp
$H_{0}$: The proportion of SO users who are hobbyists is the same for those under thirty as those at least thirty.
$H_{0}$: $p_{\geq30} - p_{<30} = 0$
$H_{A}$: The proportion of SO users who are hobbyists is different for those under thirty as those at least thirty.
$H_{A}$: $p_{\geq30} - p_{<30} \neq 0$
alpha <- 0.05
$$ z = \frac{(\hat{p}_{\geq30} - \hat{p}_{<30}) - 0}{\text{SE}(\hat{p}_{\geq30} - \hat{p}_{<30})} $$
$$ \text{SE}(\hat{p}_{\geq30} - \hat{p}_{<30}) = \sqrt{\dfrac{\hat{p} \times (1 - \hat{p})}{n_{\geq30}} + \dfrac{\hat{p} \times (1 - \hat{p})}{n_{<30}}} $$
$\hat{p}$ is a pooled estimate for $p$ (common unknown proportion of successes).
$$ \hat{p} = \frac{n_{\geq30} \times \hat{p}_{\geq30} + n_{<30} \times \hat{p}_{<30}}{n_{\geq30} + n_{<30} } $$
We only need to calculate 4 numbers: $\hat{p}_{\geq30}$, $\hat{p}_{<30}$, $n_{\geq30}$, $n_{<30}$.
stack_overflow %>%
group_by(age_cat) %>%
summarize(
p_hat = mean(hobbyist == "Yes"),
n = n()
)
# A tibble: 2 x 3
age_cat p_hat n
<chr> <dbl> <int>
1 At least 30 0.773 1050
2 Under 30 0.843 1216
z_score
-4.217
library(infer) stack_overflow %>% prop_test(
hobbyist ~ age_cat, # proportions ~ categories
order = c("At least 30", "Under 30"), # which p-hat to subtract
success = "Yes", # which response value to count proportions of
alternative = "two-sided", # type of alternative hypothesis
correct = FALSE # should Yates' continuity correction be applied?
)
# A tibble: 1 x 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 17.8 1 0.0000248 two.sided 0.0605 0.165
Hypothesis Testing in R