Anomaly detection

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

Anomalies and outliers

Supervised

Two clusters shown in black, with some isolated red points.

Unsupervised

Two clusters shown in black with some isolated black points.

Designing Machine Learning Workflows in Python

Anomalies and outliers

Two clusters shown in black with some isolated black points.

  • One of the two classes is very rare
  • Extreme case of dataset shift
  • Examples:
    • cybersecurity
    • fraud detection
    • anti-money laundering
    • fault detection
Designing Machine Learning Workflows in Python

Unsupervised workflows

Two clusters shown in black with some isolated black points that are circled in red.

  • How to fit an algorithm without labels?
  • How to estimate its performance?

Careful use of a handful of labels:

  • too few for training without overfitting
  • just enough for model selection
  • drop unbiased estimate of accuracy

A dataset split into a chunk for training, one for selection and one for validation, with labels available only for selection.

Designing Machine Learning Workflows in Python
  • Outlier: a datapoint that lies outside the range of the majority of the data

Two clusters shown in black with some isolated black points. The point that is furthest away is circled in red.

  • Local outlier: a datapoint that lies in an isolated region without other data

Two clusters shown in black with some isolated black points. The isolated points that lie in between the two clusters are circled in red.

Designing Machine Learning Workflows in Python

Local outlier factor (LoF)

Two clusters shown in black with some isolated black points. One point that is in between the two classes is circled in red, and its nearest neighbor which lies near one of the clusters is circled in blue.

Designing Machine Learning Workflows in Python

Local outlier factor (LoF)

from sklearn.neighbors import 
   LocalOutlierFactor as lof
clf = lof()
y_pred = clf.fit_predict(X)
y_pred[:4]
array([ 1,  1,  1, -1])
clf.negative_outlier_factor_[:4]
array([-0.99, -1.02, -1.08 , -0.97])
confusion_matrix(
   y_pred, ground_truth)
array([[  5,  16],
       [  0, 184]])

The same clusters with some isolated points. A large number of points are circled in red both from within the clusters and from the isolated ones.

Designing Machine Learning Workflows in Python

Local outlier factor (LoF)

clf = lof(contamination=0.02)
y_pred = clf.fit_predict(X)
confusion_matrix(
   y_pred, ground_truth)
array([[  5,   0],
       [  0, 200]])

The same clusters with some isolated points. Only isolated points are circled in red.

Designing Machine Learning Workflows in Python

Who needs labels anyway!

Designing Machine Learning Workflows in Python

Preparing Video For Download...