What are anomalies and outliers?

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Inliers vs. outliers

  • Anomaly detection: detecting abnormal data points
  • Inliers:
    • "Normal" data points
    • Represent the majority
  • Outliers:
    • Occur very rarely
    • Statistically different

Two examples of outliers: the first one of a red tulip in a field of green and another a single black stick separated from a group of white sticks

Anomaly Detection in Python

Planet Earth as an anomaly

All planets are inliers

The image of the Milky Way galaxy from very far.

Only Earth is an outlier

A curious image of Planet Earth that reflects there is life on the planet through the atmosphere as lens.

Anomaly Detection in Python

Statistical definition

  • Abnormally different
  • Have significantly different features
  • Observer decides whether a datapoint is an outlier
Anomaly Detection in Python

Applications of anomaly detection

Image of the text "security" with a mouse point hovering over it.

Anomaly Detection in Python

Applications of anomaly detection

An image of a blue gloved hand holding a small chemical container with blue liquid in it.

Anomaly Detection in Python

Applications of anomaly detection

Image of a man handing his Visa card to a woman holding a checkout terminal.

Anomaly Detection in Python

Example data

import pandas as pd

numbers = pd.Series([24, 46, 30, 28, 1289, 25, 21, 31, 48, 47])
Anomaly Detection in Python

Affected mean and variance

Data with outlier removed

numbers_a = pd.Series([24, 46, ...])
numbers_a.mean()
33.33
numbers_a.var()
114.5

Data with outlier

numbers_b = pd.Series([1289, 24, ...])
numbers_b.mean()
158.9
numbers_b.var()
157771.65
Anomaly Detection in Python

Anomalies in training data

  • Anomalies create noise
  • May be mistaken for a new sub-group
  • Take emphasis away from the real patterns
Anomaly Detection in Python

Outlier vs. novelty detection

Outlier detection

  • Outliers only exist in the training data

Novelty detection

  • Novelties only exist in new data
Anomaly Detection in Python

5-number summary

import pandas as pd

big_mart = pd.read_csv("big_mart.csv")
sales = big_mart['sales']
sales.describe()
count     8523.000000
mean      2181.288914
std       1706.499616
min         33.290000
25%        834.247400
50%       1794.331000
75%       3101.296400
max      13086.964800
Anomaly Detection in Python

Plot a histogram

import numpy as np
import matplotlib.pyplot as plt

# Find the square root of the length of sales
n_bins = np.sqrt(len(sales))
# Cast to an integer
n_bins = int(n_bins)


# Plot plt.figure(figsize=(8, 4)) plt.hist(sales, bins=n_bins, color='red')
Anomaly Detection in Python

The resulting histogram

A histogram of product sales with red bins and a long right tail.

Anomaly Detection in Python

Plot a scatterplot

integers = range(len(sales))

plt.figure(figsize=(16, 8))
plt.scatter(integers, sales, c='red', alpha=0.5)
Anomaly Detection in Python

The resulting scatterplot

A scatterplot which shows a cloud of dots condensed in the bottom half and getting clearer towards the top

Anomaly Detection in Python

Let's practice

Anomaly Detection in Python

Preparing Video For Download...