Using z-scores for Anomaly Detection

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

What are z-scores?

  • Z-scores tell:
    • the number of STDs from the mean

Example:

  • In a distribution with $\mu=10$ and $\sigma=3$:
    • $Z_{16.3}=(16.3-10) / 3=2.1$

The formula to calculate z-scores.

Anomaly Detection in Python

The Empirical Rule and outliers

The Empirical Rule:

  • 68% is within one STD
  • 95% is within two STDs
  • 99.7% is within three STDs

Outliers:

  • outside the three STD limit
  • go into the tails (pink areas on the sides)

An image of the Empirical Rule from Wikipedia that shows a normal distribution with its 68%, 95% and 99.7% parts annotated with scaffolds.

1 Image from the Empirical Rule page on Wikipedia
Anomaly Detection in Python

Z-scores in code

from scipy.stats import zscore


scores = zscore(sales) scores[:5]
0    0.910601
1   -1.018440
2   -0.049238
3    0.849103
4   -0.695373
Anomaly Detection in Python

Z-scores in code

is_over_3 = np.abs(scores) > 3

is_over_3[:5]
0    False
1    False
2    False
3    True
4    False
Anomaly Detection in Python

Z-scores in code

outliers = sales[is_over_3]

print(len(outliers))
90
Anomaly Detection in Python

Drawbacks of z-scores

  • Only works best with normally distributed data
  • Mean and STD are heavily influenced by outliers
  • Performance suffers from too many outliers
Anomaly Detection in Python

Median Absolute Deviation (MAD)

  • Measures dispersion (variability)
  • More resilient to outliers
  • Uses median at its core

The formula to calculate the Median Absolute Deviation score.

Anomaly Detection in Python

MAD score

from scipy.stats import median_abs_deviation

mad_score = median_abs_deviation(sales)

mad_score
1081.925
Anomaly Detection in Python

Introduction to PyOD

  • Modified z-scores with MAD is implemented in PyOD
  • PyOD - Python Outlier Detection library:
    • offers more than 40 algorithms
    • all algorithms have sklearn-like syntax
Anomaly Detection in Python

Modified z-scores in code

from pyod.models.mad import MAD

# threshold defaults to 3.5
mad = MAD(threshold=3.5)


# Reshape sales sales_reshaped = sales.values.reshape(-1, 1)
Anomaly Detection in Python

Modified z-scores in code

labels = mad.fit_predict(sales_reshaped)

print(labels.sum())
83
Anomaly Detection in Python

Let's practice!

Anomaly Detection in Python

Preparing Video For Download...