Getting started with Isolation Forests

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Survey data

  • A sample respondent:
    • 12 years old
    • 160 cm tall
    • weighs 190 pounds

An image of a cartoon 12-year-old boy with a hat

Anomaly Detection in Python

Multivariate anomalies

Multivariate anomalies:

  • have two or more attributes
  • attributes are not necessarily anomalous
  • only anomalous when all attributes are considered
Anomaly Detection in Python

Decision trees

A root node and one termination node of a decision tree that checks if 5 is prime

Anomaly Detection in Python

Decision trees

Fully grown decision tree with three levels that checks if 5 is prime

Anomaly Detection in Python

Isolation Trees

iTrees:

  • short for isolation trees
  • randomized versions of decision trees
  • splitting (branching) occurs randomly
  • random split is more likely to occur in inlier/outlier gap
Anomaly Detection in Python

Example 2D data

An example 2D dataset with 9 datapoints, two of which are outliers

Anomaly Detection in Python

Fitting an iTree

The 2D dataset split into two in the 2D cartesian plane by a green line

The root node of an iTree fitted to a 2D dataset

Anomaly Detection in Python

Fitting an iTree

The 2D dataset split into four in the 2D cartesian plane by two perpendicular lines

Two nodes of an iTree that finds 2 outliers

Anomaly Detection in Python

Fitting an iTree

The 2D dataset split into multiple parts in the 2D cartesian plane by 6 differently colored lines

The rest of an iTree being fitted to the 2D dataset

Anomaly Detection in Python

How points are classified

Points are outliers:

  • if close to the root node
  • or require fewer splits

The rest of an iTree being fitted to the 2D dataset

Anomaly Detection in Python

US Airbnb data

import pandas as pd

airbnb_df = pd.read_csv("airbnb.csv")
Anomaly Detection in Python

US Airbnb data

airbnb_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
 0   minimum_nights                  10000 non-null  int64  
 1   number_of_reviews               10000 non-null  int64  
 2   reviews_per_month               10000 non-null  float64
 3   calculated_host_listings_count  10000 non-null  int64  
 4   availability_365                10000 non-null  int64  
 5   price                           10000 non-null  int64  
dtypes: float64(1), int64(5)
Anomaly Detection in Python

fit_predict

from pyod.models.iforest import IForest

iforest = IForest() labels = iforest.fit_predict(airbnb_df) print(labels)
array([0, 0, 0, ..., 1, 0, 0])

Anomaly Detection in Python

Filter outliers

outliers = airbnb_df[labels == 1]

print(outliers.shape)
(1000, 6)
Anomaly Detection in Python

Let's practice!

Anomaly Detection in Python

Preparing Video For Download...