One-sample proportion tests

Hypothesis Testing in Python

James Chapman

Curriculum Manager, DataCamp

Chapter 1 recap

Is a claim about an unknown population proportion feasible?

Standard error of sample statistic from bootstrap distribution
Compute a standardized test statistic
Calculate a p-value
Decide which hypothesis made most sense

Now, calculate the test statistic without using the bootstrap distribution

Standardized test statistic for proportions

$p$: population proportion (unknown population parameter)

$\hat{p}$: sample proportion (sample statistic)

$p_{0}$: hypothesized population proportion

$$ z = \frac{\hat{p} - \text{mean}(\hat{p})}{\text{SE}(\hat{p})} = \frac{\hat{p} - p}{\text{SE}(\hat{p})} $$

Assuming $H_{0}$ is true, $p = p_{0}$, so

$$ z = \dfrac{\hat{p} - p_{0}}{\text{SE}(\hat{p})} $$

Simplifying the standard error calculations

$SE_{\hat{p}} = \sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}$ $\rightarrow$ Under $H_0$, $SE_{\hat{p}}$ depends on hypothesized $p_0$ and sample size $n$

Assuming $H_{0}$ is true,

$z = \dfrac{\hat{p} - p_{0}}{\sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}}$

Only uses sample information ($\hat{p}$ and $n$) and the hypothesized parameter ($p_{0}$)

Why z instead of t?

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$

$s$ is calculated from $\bar{x}$
- $\bar{x}$ estimates the population mean
- $s$ estimates the population standard deviation
- $\uparrow$ uncertainty in our estimate of the parameter
t-distribution - fatter tails than a normal distribution
$\hat{p}$ only appears in the numerator, so z-scores are fine

Stack Overflow age categories

$H_{0}$: Proportion of Stack Overflow users under thirty $=0.5$

$H_{A}$: Proportion of Stack Overflow users under thirty $\neq0.5$

alpha = 0.01

stack_overflow['age_cat'].value_counts(normalize=True)

Under 30       0.535604
At least 30    0.464396
Name: age_cat, dtype: float64

Variables for z

p_hat = (stack_overflow['age_cat'] == 'Under 30').mean()

0.5356037151702786

p_0 = 0.50

n = len(stack_overflow)

Calculating the z-score

$z = \dfrac{\hat{p} - p_{0}}{\sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}}$

import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator

3.385911440783663

Calculating the p-value

CDF of the normal distribution. The part of the line that's less than -2 is in red and the part of the line that's more than 2 is in green. Left-tailed ("less than"):

from scipy.stats import norm
p_value = norm.cdf(z_score)

Right-tailed ("greater than"):

p_value = 1 - norm.cdf(z_score)

Two-tailed ("not equal"):

p_value = norm.cdf(-z_score) + 
  1 - norm.cdf(z_score)

p_value = 2 * (1 - norm.cdf(z_score))

0.0007094227368100725

p_value <= alpha

True

Let's practice!

Hypothesis Testing in Python