Normal data

Experimental Design in Python

James Chapman

Curriculum Manager, DataCamp

The normal distribution

The familiar 'bell curve' shape
Related to z-score work

$$ {z} = \frac{x-\mu}{\sigma}$$

Mean = 0, std = 1
- 'How many standard deviations is this point from the mean?'
- 'What is the probability of obtaining this score?'

A plot of a typical bell curve of data, a blue bell curve line on a white background.

Normal data and statistical tests

Required for parametric tests
Nonparametric tests: don't assume normal data

A plot of a typical bell curve of data, a blue bell curve line on a white background.

Normal, Z, and alpha

Crucial link to significance level ($\alpha$)
Compare p-value to $\alpha$
Probability of a Type I error

A normal distribution with two small areas at each tail filled in black

Visualizing normal data

sns.displot(data=salaries,
            x='salary',
            kind="kde")
plt.show()

A bell curve distribution that is taller and less wide than a traditional one, but still with the typical bell curve shape

QQ plots

QQ plot: compare data to a particular distribution

from statsmodels.graphics.gofplots import qqplot
from scipy.stats.distributions import norm
qqplot(salaries['salary'], 
       line='s', 
       dist=norm)
plt.show()

Ideal: dots hugging line
Bad: bow out at ends

A qq plot where all the dots are mostly hugging the middle 45-degree line closely

A qq plot where the dots in the middle of the 45-degree line hug the line closely but at both ends the dots bow inwards making a curved line from the dots

Tests for normality

Shapiro-Wilk (good for smaller datasets)
D'Agostino $K^2$ (uses kurtosis and skewness)
Anderson-Darling (returns list of values)

$H_0$ = "Data is drawn from a Normal Distribution"

A Shapiro-Wilk test

from scipy.stats import shapiro
alpha = 0.05

stat, p = shapiro(salaries['salary'])
print(f"p: {round(p,4)} test stat: {round(stat,4)}")

p: 0.8293 test stat: 0.9956

p > alpha
- Fail to reject $H_0$ → likely normal

An Anderson-Darling test

from scipy.stats import anderson
result = anderson(x=salaries['salary'], dist="norm")

print(round(result.statistic,4))
print(result.significance_level)
print(result.critical_values)

0.2748
[15.  10.   5.   2.5  1. ]
[0.572 0.651 0.781 0.911 1.084]

0.2748 < [0.572 0.651 0.781 0.911 1.084]
- Fail to reject $H_0$ → likely normal

Let's practice!

Experimental Design in Python