Hypothesis Testing in R
Richie Cotton
Data Evangelist at DataCamp
stack_overflow_imbalanced %>%
count(hobbyist, age_cat, .drop = FALSE)
hobbyist age_cat n
1 No At least 30 0
2 No Under 30 191
3 Yes At least 30 15
4 Yes Under 30 1025
A sample is imbalanced if some groups are much bigger than others.
$H_{0}$: The proportion of hobbyists under 30 is the same as the proportion of hobbyists at least 30.
$H_{A}$: The proportion of hobbyists under 30 is different from the proportion of hobbyists at least 30.
alpha <- 0.1
stack_overflow_imbalanced %>%
prop_test(
hobbyist ~ age_cat,
order = c("At least 30", "Under 30"),
success = "Yes",
alternative = "two.sided",
correct = FALSE
)
# A tibble: 1 x 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 2.79 1 0.0949 two.sided 0.00718 0.0217
Plot type | base-R | ggplot2 |
---|---|---|
Scatter plot | plot(, type = "p") |
ggplot() + geom_point() |
Line plot | plot(, type = "l") |
ggplot() + geom_line() |
Histogram | hist() |
ggplot() + geom_histogram() |
Box plot | boxplot() |
ggplot() + geom_boxplot() |
Bar plot | barplot() |
ggplot() + geom_bar() |
Pie plot | pie() |
ggplot() + geom_bar() + coord_polar() |
infer
package.generate()
makes simulated data.null_distn <- dataset %>%
specify() %>%
hypothesize() %>%
generate() %>%
calculate()
obs_stat <- dataset %>%
specify() %>%
calculate()
get_p_value(null_distn, obs_stat)
specify()
selects the variable(s) you want to test.
response ~ explanatory
.response ~ NULL
.stack_overflow_imbalanced %>%
specify(hobbyist ~ age_cat, success = "Yes")
Response: hobbyist (factor)
Explanatory: age_cat (factor)
# A tibble: 1,231 x 2
hobbyist age_cat
<fct> <fct>
1 Yes At least 30
2 Yes At least 30
3 Yes At least 30
4 Yes Under 30
5 Yes At least 30
6 Yes At least 30
7 No Under 30
# ... with 1,224 more rows
hypothesize()
declares the type of null hypothesis.
"independence"
or "point"
."point"
.stack_overflow_imbalanced %>%
specify(hobbyist ~ age_cat, success = "Yes") %>%
hypothesize(null = "independence")
Response: hobbyist (factor)
Explanatory: age_cat (factor)
Null Hypothesis: independence
# A tibble: 1,231 x 2
hobbyist age_cat
<fct> <fct>
1 Yes At least 30
2 Yes At least 30
3 Yes At least 30
4 Yes Under 30
5 Yes At least 30
6 Yes At least 30
7 No Under 30
# ... with 1,224 more rows
Hypothesis Testing in R