Overview of Isolation Forest hyperparameters

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Most important hyperparameters

Hyperparameters which influence IForest the most:

contamination
n_estimators
max_samples
max_features

What is contamination?

How IForest classifies data points:

Raw anomaly scores are generated
Set a threshold called contamination
The highest percentage of anomaly scores denoted with contamination are chosen as outlying datapoints

Setting contamination

from pyod.models.iforest import IForest


# Accepts a value between 0 and 0.5
iforest = IForest(contamination=0.05)

What is n_estimators?

# More trees for larger datasets
iforest = IForest(n_estimators=1000)

iforest.fit(airbnb_df)

max_samples and max_features

iforest = IForest(n_estimators=200, max_samples=0.6, max_features=0.9)


iforest.fit(airbnb_df)

Tree growth

iTrees:
- grow in a randomized fashion
- split is chosen randomly between feature min and max
- grow until:
  - all points are isolated
  - maximum depth is reached

Max tree depth

Equals the logarithm of the sample size

IForest advantages

Very efficient on large datasets
Doesn't need all normal instances like other algorithms
No statistical assumptions
Performs well out-of-the-box

Challenges of outlier detection

Supervised-learning models rely on metrics like RMSE or log loss
Outlier detection is an unsupervised-learning problem
Outlier classifiers should be combined with supervised-learning models

Let's practice!

Anomaly Detection in Python

Preparing Video For Download...