Hypothesis Testing in R
Richie Cotton
Data Evangelist at DataCamp
converted_comp
is a numerical variable.age_first_code_cut
is a categorical variable with levels ("child"
and "adult"
).$H_{0}$: The mean compensation (in USD) is the same for those that coded first as a child and those that coded first as an adult.
$H_{0}$: $\mu_{child} = \mu_{adult}$
$H_{0}$: $\mu_{child} - \mu_{adult} = 0$
$H_{A}$: The mean compensation (in USD) is greater for those that coded first as a child compared to those that coded first as an adult.
$H_{A}$: $\mu_{child} > \mu_{adult}$
$H_{A}$: $\mu_{child} - \mu_{adult} > 0$
stack_overflow %>%
group_by(age_first_code_cut) %>%
summarize(mean_compensation = mean(converted_comp))
# A tibble: 2 x 2
age_first_code_cut mean_compensation
<chr> <dbl>
1 adult 111544.
2 child 138275.
$z = \dfrac{\text{sample stat} - \text{population parameter}}{\text{standard error}}$
$t = \dfrac{\text{difference in sample stats} - \text{difference in population parameters}}{\text{standard error}}$
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) - (\mu_{\text{child}} - \mu_{\text{adult}})}{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$
$SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) \approx \sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}$
$s$ is the standard deviation of the variable.
$n$ is the sample size (number of observations/rows in sample).
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) - (\mu_{\text{child}} - \mu_{\text{adult}})}{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$
$H_{0}$: $\mu_{\text{child}} - \mu_{\text{adult}} = 0$
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) }{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$
stack_overflow %>%
group_by(age_first_code_cut) %>%
summarize(
xbar = mean(converted_comp),
s = sd(converted_comp),
n = n()
)
# A tibble: 2 x 4
age_first_code_cut xbar s n
<chr> <dbl> <dbl> <int>
1 adult 111544. 270381. 1579
2 child 138275. 278130. 1001
# A tibble: 2 x 4
age_first_code_cut xbar s n
<chr> <dbl> <dbl> <int>
1 adult 111544. 270381. 1579
2 child 138275. 278130. 1001
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$
numerator <- xbar_child - xbar_adult
denominator <- sqrt(
s_child ^ 2 / n_child + s_adult ^ 2 / n_adult
)
t_stat <- numerator / denominator
2.4046
Hypothesis Testing in R