Hypothesis Testing in Python
James Chapman
Curriculum Manager, DataCamp
age_first_code_cut
classifies when Stack Overflow user first started programming"adult"
means they started at 14 or older"child"
means they started before 14A hypothesis is a statement about an unknown population parameter
A hypothesis test is a test of two competing hypotheses
The null hypothesis ($H_{0}$) is the existing idea
The alternative hypothesis ($H_{A}$) is the new "challenger" idea of the researcher
For our problem:
Significance level is "beyond a reasonable doubt" for hypothesis testing
Hypothesis tests check if the sample statistics lie in the tails of the null distribution
Test | Tails |
---|---|
alternative different from null | two-tailed |
alternative greater than null | right-tailed |
alternative less than null | left-tailed |
$H_{A}$: The proportion of data scientists starting programming as children is greater than 35%
This is a right-tailed test
p-values: probability of obtaining a result, assuming the null hypothesis is true
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()
0.39141972578505085
prop_child_hyp = 0.35
std_error = np.std(first_code_boot_distn, ddof=1)
0.010351057228878566
z_score = (prop_child_samp - prop_child_hyp) / std_error
4.001497129152506
norm.cdf()
is normal CDF from scipy.stats
.norm.cdf()
.1 - norm.cdf()
.
from scipy.stats import norm
1 - norm.cdf(z_score, loc=0, scale=1)
3.1471479512323874e-05
Hypothesis Testing in Python