How to deal with found outliers

Anomaly Detection in Python

Bekhruz (Bex) Tuychiev

Kaggle Master, Data Science Content Creator

Applications of anomaly detection

  • Medicine
  • Cyber security
  • Fraud detection

Perform two analyses - with and without outliers.

Anomaly Detection in Python

The reasons for outlier presence

  • Data entry errors:
    • Typos
    • Measurement errors
    • Human mistakes
    • Drop unless fixed
  • Sampling errors:
    • Not from the target distribution
    • Drop
  • Natural:
    • Naturally odd but comes from the population
    • Do not drop
Anomaly Detection in Python

Drop based on magnitude

  • Too few: confirm and drop
  • Too many: raises suspicion - use different models:
    • GLMs
    • Quantile Regression
    • GEEs
  • Forms a cluster: perform deeper analysis
Anomaly Detection in Python

Trimming

# Calculate the percentiles
percentile_first = google['Volume'].quantile(0.01)
percentile_99th = google['Volume'].quantile(0.99)


# Trim google['Volume'] = google['Volume'].clip(percentile_first, percentile_99th)
Anomaly Detection in Python

Replacing

google.replace(0, 100, inplace=True)
Anomaly Detection in Python

Let's practice!

Anomaly Detection in Python

Preparing Video For Download...