Measures of spread

Introduction to Statistics in Python

Maggie Matsui

Content Developer, DataCamp

What is spread?

Two histograms: one that's narrow with data only spanning a few values, one that's wider with data spanning more values.

Variance

Average distance from each data point to the data's mean

A dot plot of 7 data points with a red line in the middle representing mean.

Variance

Average distance from each data point to the data's mean

A dot plot of 7 data points with a red line in the middle representing mean. Arrows are drawn between each dot and the mean line.

Calculating variance

1. Subtract mean from each data point

dists = msleep['sleep_total'] - 
        np.mean(msleep['sleep_total'])
print(dists)

0     1.666265
1     6.566265
2     3.966265
3     4.466265
4    -6.433735
      ...

2. Square each distance

sq_dists = dists ** 2
print(sq_dists)

0      2.776439
1     43.115837
2     15.731259
3     19.947524
4     41.392945
      ...

Calculating variance

3. Sum squared distances

sum_sq_dists = np.sum(sq_dists)
print(sum_sq_dists)

1624.065542

4. Divide by number of data points - 1

variance = sum_sq_dists / (83 - 1)
print(variance)

19.805677

Use np.var()

np.var(msleep['sleep_total'], ddof=1)

19.805677

Without ddof=1, population variance is calculated instead of sample variance:

np.var(msleep['sleep_total'])

19.567055

Standard deviation

np.sqrt(np.var(msleep['sleep_total'], ddof=1))

4.450357

np.std(msleep['sleep_total'], ddof=1)

4.450357

Mean absolute deviation

dists = msleep['sleep_total'] - np.mean(msleep['sleep_total'])

np.mean(np.abs(dists))

3.566701

Standard deviation vs. mean absolute deviation

Standard deviation squares distances, penalizing longer distances more than shorter ones.
Mean absolute deviation penalizes each distance equally.
One isn't better than the other, but SD is more common than MAD.

Quantiles

np.quantile(msleep['sleep_total'], 0.5)

10.1

0.5 quantile = median

Quartiles:

np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])

array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])

Boxplots use quartiles

import matplotlib.pyplot as plt
plt.boxplot(msleep['sleep_total'])
plt.show()

sleep_total boxplot.png

Quantiles using np.linspace()

np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])

array([ 1.9 ,  6.24,  9.48, 11.14, 14.4 , 19.9 ])

np.linspace(start, stop, num)

np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))

array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])

Interquartile range (IQR)

Height of the box in a boxplot

np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)

5.9

from scipy.stats import iqr
iqr(msleep['sleep_total'])

5.9

Outliers

Outlier: data point that is substantially different from the others

How do we know what a substantial difference is? A data point is an outlier if:

$\text{data} < \text{Q1} - 1.5\times\text{IQR}$ or
$\text{data} > \text{Q3} + 1.5\times\text{IQR}$

Finding outliers

from scipy.stats import iqr
iqr = iqr(msleep['bodywt'])

lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr
upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr

msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]

                    name   vore  sleep_total    bodywt
4                    Cow  herbi          4.0   600.000
20        Asian elephant  herbi          3.9  2547.000
22                 Horse  herbi          2.9   521.000
...

All in one go

msleep['bodywt'].describe()

count      83.000000
mean      166.136349
std       786.839732
min         0.005000
25%         0.174000
50%         1.670000
75%        41.750000
max      6654.000000
Name: bodywt, dtype: float64

Let's practice!

Introduction to Statistics in Python