Measures of spread

Introduction to Statistics in Python

Maggie Matsui

Content Developer, DataCamp

What is spread?

Two histograms: one that's narrow with data only spanning a few values, one that's wider with data spanning more values.

Introduction to Statistics in Python

Variance

Average distance from each data point to the data's mean

A dot plot of 7 data points with a red line in the middle representing mean.

Introduction to Statistics in Python

Variance

Average distance from each data point to the data's mean

A dot plot of 7 data points with a red line in the middle representing mean. Arrows are drawn between each dot and the mean line.

Introduction to Statistics in Python

Calculating variance

1. Subtract mean from each data point

dists = msleep['sleep_total'] - 
        np.mean(msleep['sleep_total'])
print(dists)
0     1.666265
1     6.566265
2     3.966265
3     4.466265
4    -6.433735
      ...

2. Square each distance

sq_dists = dists ** 2
print(sq_dists)
0      2.776439
1     43.115837
2     15.731259
3     19.947524
4     41.392945
      ...
Introduction to Statistics in Python

Calculating variance

3. Sum squared distances

sum_sq_dists = np.sum(sq_dists)
print(sum_sq_dists)
1624.065542

4. Divide by number of data points - 1

variance = sum_sq_dists / (83 - 1)
print(variance)
19.805677

Use np.var()

np.var(msleep['sleep_total'], ddof=1)
19.805677

Without ddof=1, population variance is calculated instead of sample variance:

np.var(msleep['sleep_total'])
19.567055
Introduction to Statistics in Python

Standard deviation

np.sqrt(np.var(msleep['sleep_total'], ddof=1))
4.450357
np.std(msleep['sleep_total'], ddof=1)
4.450357
Introduction to Statistics in Python

Mean absolute deviation

dists = msleep['sleep_total'] - np.mean(msleep['sleep_total'])

np.mean(np.abs(dists))
3.566701

Standard deviation vs. mean absolute deviation

  • Standard deviation squares distances, penalizing longer distances more than shorter ones.
  • Mean absolute deviation penalizes each distance equally.
  • One isn't better than the other, but SD is more common than MAD.
Introduction to Statistics in Python

Quantiles

np.quantile(msleep['sleep_total'], 0.5)
10.1

$$

                        0.5 quantile = median

Quartiles:

np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])
array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])
Introduction to Statistics in Python

Boxplots use quartiles

import matplotlib.pyplot as plt
plt.boxplot(msleep['sleep_total'])
plt.show()

sleep_total boxplot.png

Introduction to Statistics in Python

Quantiles using np.linspace()

np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])
array([ 1.9 ,  6.24,  9.48, 11.14, 14.4 , 19.9 ])

 

np.linspace(start, stop, num)

np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))
array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])
Introduction to Statistics in Python

Interquartile range (IQR)

Height of the box in a boxplot

np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)
5.9
from scipy.stats import iqr
iqr(msleep['sleep_total'])
5.9
Introduction to Statistics in Python

Outliers

Outlier: data point that is substantially different from the others

How do we know what a substantial difference is? A data point is an outlier if:

  • $\text{data} < \text{Q1} - 1.5\times\text{IQR}$    or
  • $\text{data} > \text{Q3} + 1.5\times\text{IQR}$
Introduction to Statistics in Python

Finding outliers

from scipy.stats import iqr
iqr = iqr(msleep['bodywt'])

lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr
msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]
                    name   vore  sleep_total    bodywt
4                    Cow  herbi          4.0   600.000
20        Asian elephant  herbi          3.9  2547.000
22                 Horse  herbi          2.9   521.000
...
Introduction to Statistics in Python

All in one go

msleep['bodywt'].describe()
count      83.000000
mean      166.136349
std       786.839732
min         0.005000
25%         0.174000
50%         1.670000
75%        41.750000
max      6654.000000
Name: bodywt, dtype: float64
Introduction to Statistics in Python

Let's practice!

Introduction to Statistics in Python

Preparing Video For Download...