Hypothesis Testing in Python
James Chapman
Curriculum Manager, DataCamp
converted_comp
is a numerical variableage_first_code_cut
is a categorical variable with levels ("child"
and "adult"
)$H_{0}$: The mean compensation (in USD) is the same for those that coded first as a child and those that coded first as an adult.
$H_{0}$: $\mu_{child} = \mu_{adult}$
$H_{0}$: $\mu_{child} - \mu_{adult} = 0$
$H_{A}$: The mean compensation (in USD) is greater for those that coded first as a child compared to those that coded first as an adult.
$H_{A}$: $\mu_{child} > \mu_{adult}$
$H_{A}$: $\mu_{child} - \mu_{adult} > 0$
stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()
age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64
$z = \dfrac{\text{sample stat} - \text{population parameter}}{\text{standard error}}$
$t = \dfrac{\text{difference in sample stats} - \text{difference in population parameters}}{\text{standard error}}$
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) - (\mu_{\text{child}} - \mu_{\text{adult}})}{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$
$SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) \approx \sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}$
$s$ is the standard deviation of the variable
$n$ is the sample size (number of observations/rows in sample)
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) - (\mu_{\text{child}} - \mu_{\text{adult}})}{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$
$H_{0}$: $\mu_{\text{child}} - \mu_{\text{adult}} = 0$ $\rightarrow$ $t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) }{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$
xbar = stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut
s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()
adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut
n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()
adult 1376
child 885
Name: converted_comp, dtype: int64
$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$
import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
Hypothesis Testing in Python