Getting started with Isolation Forests

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Survey data

A sample respondent:
- 12 years old
- 160 cm tall
- weighs 190 pounds

An image of a cartoon 12-year-old boy with a hat

Multivariate anomalies

Multivariate anomalies:

have two or more attributes
attributes are not necessarily anomalous
only anomalous when all attributes are considered

Decision trees

A root node and one termination node of a decision tree that checks if 5 is prime

Decision trees

Fully grown decision tree with three levels that checks if 5 is prime

Isolation Trees

iTrees:

short for isolation trees
randomized versions of decision trees
splitting (branching) occurs randomly
random split is more likely to occur in inlier/outlier gap

Example 2D data

An example 2D dataset with 9 datapoints, two of which are outliers

Fitting an iTree

The 2D dataset split into two in the 2D cartesian plane by a green line

The root node of an iTree fitted to a 2D dataset

Fitting an iTree

The 2D dataset split into four in the 2D cartesian plane by two perpendicular lines

Two nodes of an iTree that finds 2 outliers

Fitting an iTree

The 2D dataset split into multiple parts in the 2D cartesian plane by 6 differently colored lines

The rest of an iTree being fitted to the 2D dataset

How points are classified

Points are outliers:

if close to the root node
or require fewer splits

The rest of an iTree being fitted to the 2D dataset

US Airbnb data

import pandas as pd

airbnb_df = pd.read_csv("airbnb.csv")

US Airbnb data

airbnb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
 0   minimum_nights                  10000 non-null  int64  
 1   number_of_reviews               10000 non-null  int64  
 2   reviews_per_month               10000 non-null  float64
 3   calculated_host_listings_count  10000 non-null  int64  
 4   availability_365                10000 non-null  int64  
 5   price                           10000 non-null  int64  
dtypes: float64(1), int64(5)

fit_predict

from pyod.models.iforest import IForest


iforest = IForest()
labels = iforest.fit_predict(airbnb_df)

print(labels)

array([0, 0, 0, ..., 1, 0, 0])

Filter outliers

outliers = airbnb_df[labels == 1]

print(outliers.shape)

(1000, 6)

Let's practice!

Anomaly Detection in Python