Overview of Isolation Forest hyperparameters

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Most important hyperparameters

Hyperparameters which influence IForest the most:

  • contamination
  • n_estimators
  • max_samples
  • max_features
Anomaly Detection in Python

What is contamination?

How IForest classifies data points:

  1. Raw anomaly scores are generated
  2. Set a threshold called contamination
  3. The highest percentage of anomaly scores denoted with contamination are chosen as outlying datapoints
Anomaly Detection in Python

Setting contamination

from pyod.models.iforest import IForest


# Accepts a value between 0 and 0.5 iforest = IForest(contamination=0.05)
Anomaly Detection in Python

What is n_estimators?

# More trees for larger datasets
iforest = IForest(n_estimators=1000)

iforest.fit(airbnb_df)
Anomaly Detection in Python

max_samples and max_features

iforest = IForest(n_estimators=200, max_samples=0.6, max_features=0.9)


iforest.fit(airbnb_df)
Anomaly Detection in Python

Tree growth

  • iTrees:
    • grow in a randomized fashion
    • split is chosen randomly between feature min and max
    • grow until:
      • all points are isolated
      • maximum depth is reached
Anomaly Detection in Python

Max tree depth

  • Equals the logarithm of the sample size
Anomaly Detection in Python

IForest advantages

  • Very efficient on large datasets
  • Doesn't need all normal instances like other algorithms
  • No statistical assumptions
  • Performs well out-of-the-box
Anomaly Detection in Python

Challenges of outlier detection

  • Supervised-learning models rely on metrics like RMSE or log loss
  • Outlier detection is an unsupervised-learning problem
  • Outlier classifiers should be combined with supervised-learning models
Anomaly Detection in Python

Let's practice!

Anomaly Detection in Python

Preparing Video For Download...