One-sample proportion tests

Hypothesis Testing in Python

James Chapman

Curriculum Manager, DataCamp

Chapter 1 recap

  • Is a claim about an unknown population proportion feasible?

 

  1. Standard error of sample statistic from bootstrap distribution
  2. Compute a standardized test statistic
  3. Calculate a p-value
  4. Decide which hypothesis made most sense

 

  • Now, calculate the test statistic without using the bootstrap distribution
Hypothesis Testing in Python

Standardized test statistic for proportions

$p$: population proportion (unknown population parameter)

$\hat{p}$: sample proportion (sample statistic)

$p_{0}$: hypothesized population proportion

$$ z = \frac{\hat{p} - \text{mean}(\hat{p})}{\text{SE}(\hat{p})} = \frac{\hat{p} - p}{\text{SE}(\hat{p})} $$

Assuming $H_{0}$ is true, $p = p_{0}$, so

$$ z = \dfrac{\hat{p} - p_{0}}{\text{SE}(\hat{p})} $$

Hypothesis Testing in Python

Simplifying the standard error calculations

$SE_{\hat{p}} = \sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}$ $\rightarrow$ Under $H_0$, $SE_{\hat{p}}$ depends on hypothesized $p_0$ and sample size $n$

Assuming $H_{0}$ is true,

$z = \dfrac{\hat{p} - p_{0}}{\sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}}$

  • Only uses sample information ($\hat{p}$ and $n$) and the hypothesized parameter ($p_{0}$)
Hypothesis Testing in Python

Why z instead of t?

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$

  • $s$ is calculated from $\bar{x}$
    • $\bar{x}$ estimates the population mean
    • $s$ estimates the population standard deviation
    • $\uparrow$ uncertainty in our estimate of the parameter
  • t-distribution - fatter tails than a normal distribution
  • $\hat{p}$ only appears in the numerator, so z-scores are fine
Hypothesis Testing in Python

Stack Overflow age categories

$H_{0}$: Proportion of Stack Overflow users under thirty $=0.5$

$H_{A}$: Proportion of Stack Overflow users under thirty $\neq0.5$

alpha = 0.01
stack_overflow['age_cat'].value_counts(normalize=True)
Under 30       0.535604
At least 30    0.464396
Name: age_cat, dtype: float64
Hypothesis Testing in Python

Variables for z

p_hat = (stack_overflow['age_cat'] == 'Under 30').mean()
0.5356037151702786
p_0 = 0.50
n = len(stack_overflow)
2261
Hypothesis Testing in Python

Calculating the z-score

$z = \dfrac{\hat{p} - p_{0}}{\sqrt{\dfrac{p_{0}*(1-p_{0})}{n}}}$

import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator
3.385911440783663
Hypothesis Testing in Python

Calculating the p-value

CDF of the normal distribution. The part of the line that's less than -2 is in red and the part of the line that's more than 2 is in green. Left-tailed ("less than"):

from scipy.stats import norm
p_value = norm.cdf(z_score)

Right-tailed ("greater than"):

p_value = 1 - norm.cdf(z_score)

Two-tailed ("not equal"):

p_value = norm.cdf(-z_score) + 
  1 - norm.cdf(z_score)
p_value = 2 * (1 - norm.cdf(z_score))
0.0007094227368100725
p_value <= alpha
True
Hypothesis Testing in Python

Let's practice!

Hypothesis Testing in Python

Preparing Video For Download...