Hypothesis Testing in R
Richie Cotton
Data Evangelist at DataCamp
stack_overflow_imbalanced %>%
count(hobbyist, age_cat, .drop = FALSE)
hobbyist age_cat n
1 No At least 30 0
2 No Under 30 191
3 Yes At least 30 15
4 Yes Under 30 1025
A sample is imbalanced if some groups are much bigger than others.
$H_{0}$: The proportion of hobbyists under 30 is the same as the proportion of hobbyists at least 30.
$H_{A}$: The proportion of hobbyists under 30 is different from the proportion of hobbyists at least 30.
alpha <- 0.1
stack_overflow_imbalanced %>%
prop_test(
hobbyist ~ age_cat,
order = c("At least 30", "Under 30"),
success = "Yes",
alternative = "two.sided",
correct = FALSE
)
# A tibble: 1 x 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 2.79 1 0.0949 two.sided 0.00718 0.0217
| Plot type | base-R | ggplot2 |
|---|---|---|
| Scatter plot | plot(, type = "p") |
ggplot() + geom_point() |
| Line plot | plot(, type = "l") |
ggplot() + geom_line() |
| Histogram | hist() |
ggplot() + geom_histogram() |
| Box plot | boxplot() |
ggplot() + geom_boxplot() |
| Bar plot | barplot() |
ggplot() + geom_bar() |
| Pie plot | pie() |
ggplot() + geom_bar() + coord_polar() |
infer package.generate() makes simulated data.null_distn <- dataset %>%
specify() %>%
hypothesize() %>%
generate() %>%
calculate()
obs_stat <- dataset %>%
specify() %>%
calculate()
get_p_value(null_distn, obs_stat)

specify()selects the variable(s) you want to test.
response ~ explanatory.response ~ NULL.stack_overflow_imbalanced %>%
specify(hobbyist ~ age_cat, success = "Yes")
Response: hobbyist (factor)
Explanatory: age_cat (factor)
# A tibble: 1,231 x 2
hobbyist age_cat
<fct> <fct>
1 Yes At least 30
2 Yes At least 30
3 Yes At least 30
4 Yes Under 30
5 Yes At least 30
6 Yes At least 30
7 No Under 30
# ... with 1,224 more rows
hypothesize()declares the type of null hypothesis.
"independence" or "point"."point".stack_overflow_imbalanced %>%
specify(hobbyist ~ age_cat, success = "Yes") %>%
hypothesize(null = "independence")
Response: hobbyist (factor)
Explanatory: age_cat (factor)
Null Hypothesis: independence
# A tibble: 1,231 x 2
hobbyist age_cat
<fct> <fct>
1 Yes At least 30
2 Yes At least 30
3 Yes At least 30
4 Yes Under 30
5 Yes At least 30
6 Yes At least 30
7 No Under 30
# ... with 1,224 more rows
Hypothesis Testing in R