Two-sample proportion tests

Hypothesis Testing in Python

James Chapman

Curriculum Manager, DataCamp

Comparing two proportions

$H_{0}$: Proportion of hobbyist users is the same for those under thirty as those at least thirty

$H_{0}$: $p_{\geq30} - p_{<30} = 0$

$H_{A}$: Proportion of hobbyist users is different for those under thirty to those at least thirty

$H_{A}$: $p_{\geq30} - p_{<30} \neq 0$

alpha = 0.05

Calculating the z-score

z-score equation for a proportion test:

$$ z = \frac{(\hat{p}_{\geq30} - \hat{p}_{<30}) - 0}{\text{SE}(\hat{p}_{\geq30} - \hat{p}_{<30})} $$

Standard error equation: $$ \text{SE}(\hat{p}_{\geq30} - \hat{p}_{<30}) = \sqrt{\dfrac{\hat{p} \times (1 - \hat{p})}{n_{\geq30}} + \dfrac{\hat{p} \times (1 - \hat{p})}{n_{<30}}} $$

$\hat{p}$ → weighted mean of $\hat{p}_{\geq30}$ and $\hat{p}_{<30}$

$$ \hat{p} = \frac{n_{\geq30} \times \hat{p}_{\geq30} + n_{<30} \times \hat{p}_{<30}}{n_{\geq30} + n_{<30} } $$

Only require $\hat{p}_{\geq30}$, $\hat{p}_{<30}$, $n_{\geq30}$, $n_{<30}$ from the sample to calculate the z-score

Getting the numbers for the z-score

p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)

age_cat      hobbyist
At least 30  Yes         0.773333
             No          0.226667
Under 30     Yes         0.843105
             No          0.156895
Name: hobbyist, dtype: float64

n = stack_overflow.groupby("age_cat")['hobbyist'].count()

age_cat
At least 30    1050
Under 30       1211
Name: hobbyist, dtype: int64

Getting the numbers for the z-score

p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)

age_cat      hobbyist
At least 30  Yes         0.773333
             No          0.226667
Under 30     Yes         0.843105
             No          0.156895
Name: hobbyist, dtype: float64

p_hat_at_least_30 = p_hats[("At least 30", "Yes")]
p_hat_under_30 = p_hats[("Under 30", "Yes")]
print(p_hat_at_least_30, p_hat_under_30)

0.773333 0.843105

Getting the numbers for the z-score

n = stack_overflow.groupby("age_cat")['hobbyist'].count()

age_cat
At least 30    1050
Under 30       1211
Name: hobbyist, dtype: int64

n_at_least_30 = n["At least 30"]
n_under_30 = n["Under 30"]
print(n_at_least_30, n_under_30)

1050 1211

Getting the numbers for the z-score

p_hat = (n_at_least_30 * p_hat_at_least_30 + n_under_30 * p_hat_under_30) / 
        (n_at_least_30 + n_under_30)

std_error = np.sqrt(p_hat * (1-p_hat) / n_at_least_30 + 
                    p_hat * (1-p_hat) / n_under_30)

z_score = (p_hat_at_least_30 - p_hat_under_30) / std_error

print(z_score)

-4.223718652693034

Proportion tests using proportions_ztest()

stack_overflow.groupby("age_cat")['hobbyist'].value_counts()

age_cat      hobbyist
At least 30  Yes          812
             No           238
Under 30     Yes         1021
             No           190
Name: hobbyist, dtype: int64

n_hobbyists = np.array([812, 1021])

n_rows = np.array([812 + 238, 1021 + 190])

from statsmodels.stats.proportion import proportions_ztest
z_score, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,

                                     alternative="two-sided")

(-4.223691463320559, 2.403330142685068e-05)

Let's practice!

Hypothesis Testing in Python