Measures of spread

Introduzione alla statistica in Python

Maggie Matsui

Content Developer, DataCamp

What is spread?

Two histograms: one that's narrow with data only spanning a few values, one that's wider with data spanning more values.

Introduzione alla statistica in Python

Variance

Average distance from each data point to the data's mean

A dot plot of 7 data points with a red line in the middle representing mean.

Introduzione alla statistica in Python

Variance

Average distance from each data point to the data's mean

A dot plot of 7 data points with a red line in the middle representing mean. Arrows are drawn between each dot and the mean line.

Introduzione alla statistica in Python

Calculating variance

1. Subtract mean from each data point

dists = msleep['sleep_total'] - 
        np.mean(msleep['sleep_total'])
print(dists)
0     1.666265
1     6.566265
2     3.966265
3     4.466265
4    -6.433735
      ...

2. Square each distance

sq_dists = dists ** 2
print(sq_dists)
0      2.776439
1     43.115837
2     15.731259
3     19.947524
4     41.392945
      ...
Introduzione alla statistica in Python

Calculating variance

3. Sum squared distances

sum_sq_dists = np.sum(sq_dists)
print(sum_sq_dists)
1624.065542

4. Divide by number of data points - 1

variance = sum_sq_dists / (83 - 1)
print(variance)
19.805677

Use np.var()

np.var(msleep['sleep_total'], ddof=1)
19.805677

Without ddof=1, population variance is calculated instead of sample variance:

np.var(msleep['sleep_total'])
19.567055
Introduzione alla statistica in Python

Standard deviation

np.sqrt(np.var(msleep['sleep_total'], ddof=1))
4.450357
np.std(msleep['sleep_total'], ddof=1)
4.450357
Introduzione alla statistica in Python

Mean absolute deviation

dists = msleep['sleep_total'] - np.mean(msleep['sleep_total'])

np.mean(np.abs(dists))
3.566701

Standard deviation vs. mean absolute deviation

  • Standard deviation squares distances, penalizing longer distances more than shorter ones.
  • Mean absolute deviation penalizes each distance equally.
  • One isn't better than the other, but SD is more common than MAD.
Introduzione alla statistica in Python

Quantiles

np.quantile(msleep['sleep_total'], 0.5)
10.1

$$

                        0.5 quantile = median

Quartiles:

np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])
array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])
Introduzione alla statistica in Python

Boxplots use quartiles

import matplotlib.pyplot as plt
plt.boxplot(msleep['sleep_total'])
plt.show()

sleep_total boxplot.png

Introduzione alla statistica in Python

Quantiles using np.linspace()

np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])
array([ 1.9 ,  6.24,  9.48, 11.14, 14.4 , 19.9 ])

 

np.linspace(start, stop, num)

np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))
array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])
Introduzione alla statistica in Python

Interquartile range (IQR)

Height of the box in a boxplot

np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)
5.9
from scipy.stats import iqr
iqr(msleep['sleep_total'])
5.9
Introduzione alla statistica in Python

Outliers

Outlier: data point that is substantially different from the others

How do we know what a substantial difference is? A data point is an outlier if:

  • $\text{data} < \text{Q1} - 1.5\times\text{IQR}$    or
  • $\text{data} > \text{Q3} + 1.5\times\text{IQR}$
Introduzione alla statistica in Python

Finding outliers

from scipy.stats import iqr
iqr = iqr(msleep['bodywt'])

lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr
msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]
                    name   vore  sleep_total    bodywt
4                    Cow  herbi          4.0   600.000
20        Asian elephant  herbi          3.9  2547.000
22                 Horse  herbi          2.9   521.000
...
Introduzione alla statistica in Python

All in one go

msleep['bodywt'].describe()
count      83.000000
mean      166.136349
std       786.839732
min         0.005000
25%         0.174000
50%         1.670000
75%        41.750000
max      6654.000000
Name: bodywt, dtype: float64
Introduzione alla statistica in Python

Let's practice!

Introduzione alla statistica in Python

Preparing Video For Download...