Hypothesis Testing in Python
James Chapman
Curriculum Manager, DataCamp
$H_{0}$: Proportion of hobbyist users is the same for those under thirty as those at least thirty
$H_{0}$: $p_{\geq30} - p_{<30} = 0$
$H_{A}$: Proportion of hobbyist users is different for those under thirty to those at least thirty
$H_{A}$: $p_{\geq30} - p_{<30} \neq 0$
alpha = 0.05
$$ z = \frac{(\hat{p}_{\geq30} - \hat{p}_{<30}) - 0}{\text{SE}(\hat{p}_{\geq30} - \hat{p}_{<30})} $$
$$ \hat{p} = \frac{n_{\geq30} \times \hat{p}_{\geq30} + n_{<30} \times \hat{p}_{<30}}{n_{\geq30} + n_{<30} } $$
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
p_hat_at_least_30 = p_hats[("At least 30", "Yes")]
p_hat_under_30 = p_hats[("Under 30", "Yes")]
print(p_hat_at_least_30, p_hat_under_30)
0.773333 0.843105
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
n_at_least_30 = n["At least 30"]
n_under_30 = n["Under 30"]
print(n_at_least_30, n_under_30)
1050 1211
p_hat = (n_at_least_30 * p_hat_at_least_30 + n_under_30 * p_hat_under_30) / (n_at_least_30 + n_under_30) std_error = np.sqrt(p_hat * (1-p_hat) / n_at_least_30 + p_hat * (1-p_hat) / n_under_30) z_score = (p_hat_at_least_30 - p_hat_under_30) / std_error
print(z_score)
-4.223718652693034
stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
n_hobbyists = np.array([812, 1021])
n_rows = np.array([812 + 238, 1021 + 190])
from statsmodels.stats.proportion import proportions_ztest z_score, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")
(-4.223691463320559, 2.403330142685068e-05)
Hypothesis Testing in Python