Normality tests

Foundations of Inference in Python

Paul Savala

Assistant Professor of Mathematics

Height of US males

A histogram that is approximately normally distributed, with a mean height of 180 centimeters, a minimum height of 160 centimeters, and a maximum height of 200 centimeters.

Model residuals

A scatter plot with years on employment on the x-axis, annual salary on the y-axis, and a generally positive linear trend. A red line of best fit is also drawn on the data.

Expect equal distribution above and below prediction

Model residuals

A histogram with "residual (error)" on the x-axis, "count" on the y-axis, and a bimodal distribution with a mode around negative ten thousand, and another mode around positive thirty thousand.

Applications of normal distributions

Parametric tests - Hypothesis tests assuming normality
T-test for comparing means:
- Assumes sample means are normally distributed
- If not, conclusions are invalid

A histogram with salaries between sixty thousand and ninety five thousand on the x-axis, and frequency on the y-axis. The histogram is relatively close to normal.

Anderson-Darling test for normality

Tests assumption of normality

$H_0$: Data is normally distributed

$H_a$: Data is not normally distributed

Anderson-Darling test in SciPy

result = stats.anderson(police_df['Annual Salary'])

result.statistic

27.41

result.critical_values

[0.574, 0.654, 0.784, 0.915, 1.088]

result.significance_level[result.statistic > result.critical_values]

[15.  10.   5.   2.5  1. ]

Fitting a normal distribution

mu, std = stats.norm.fit(police_df['Annual Salary'])


estimated_pct_under_70k = stats.norm.cdf(70000, loc=mu, scale=std)


print(estimated_pct_under_70k)

0.27

actual_under_70k = police_df[police_df['Annual Salary'] < 70000]

print(len(actual_under_70k) / len(police_df))

0.20

Let's practice!

Foundations of Inference in Python