KNN for outlier detection

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Applications of KNN

  • Supervised:
    • Regression
    • Classification
  • Unsupervised:
    • Clustering
    • Outlier detection
Anomaly Detection in Python

Simplicity of KNN

Anomaly scores are calculated in
  • Isolation Forest:
    • Tree depth
    • Sub-sample size
    • Many other components
  • KNN:
    • Only the distance between instances
Anomaly Detection in Python

Ansur Male Dataset

import pandas as pd

males = pd.read_csv("ansur_male.csv")
males.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4082 entries, 0 to 4081
Data columns (total 95 columns):
 #   Column                          Non-Null Count  Dtype
 0   abdominalextensiondepthsitting  4082 non-null   int64
 1   acromialheight                  4082 non-null   int64
 2   acromionradialelength           4082 non-null   int64
 3   anklecircumference              4082 non-null   int64
 4   axillaheight                    4082 non-null   int64
  ...
Anomaly Detection in Python

KNN in action

from pyod.models.knn import KNN

knn = KNN(contamination=0.01, n_jobs=-1)

knn.fit(males)
Anomaly Detection in Python

KNN with outlier probabilities

probs = knn.predict_proba(males)


# Use 55% threshold for filtering is_outlier = probs[:, 1] > 0.55 # Isolate the outliers outliers = males[is_outlier] len(outliers)
13
Anomaly Detection in Python

The number of neighbors

# k=20 when contamination is <=10%
knn = KNN(n_neighbors=20, n_jobs=-1)
knn.fit(males)

probs = knn.predict_proba(males)

is_outlier = probs[:, 1] > .55
outliers = males[is_outlier]

len(outliers)
15

A plot of a sample dataset with 8 instances, with A as an outlier and arrows between A and its 4 closest neighbors.

Anomaly Detection in Python

Features of KNN

A plot of a sample dataset with 8 instances, with A as an outlier and arrows between A and its 4 closest neighbors.

Anomaly Detection in Python

Drawbacks of KNN

  • Memorizes the dataset - memory-inefficient
  • Slow prediction stage
  • Sensitive to feature scales
Anomaly Detection in Python

Let's practice!

Anomaly Detection in Python

Preparing Video For Download...