Handling outliers

Intermediate Predictive Analytics in Python

Nele Verbiest

Senior Data Scientist @PythonPredictions

Influence of outliers on predictive models

Intermediate Predictive Analytics in Python

Causes of outliers

  • Human errors
  • Measuring errors
  • Truly extreme values
  • ...
Intermediate Predictive Analytics in Python

Winsorization concept

Intermediate Predictive Analytics in Python

Winsorization in Python

from scipy.stats.mstats import winsorize
basetable["variable_winsorized"] = 
     winsorize(
       basetable["variable"], 
       limits = [0.05,0.01])
Intermediate Predictive Analytics in Python

Standard deviation method concept

Intermediate Predictive Analytics in Python

Standard deviation method in Python

mean_age = basetable["age"].mean()
sd_age = basetable["age"].std()
lower_limit = mean_age - 3*sd_age
upper_limit = mean_age + 3*sd_age
basetable["age_no_outliers"] = pd.Series(
                                    [min(max(a,lower_limit), upper_limit) 
                                     for a in basetable["age"]]
                                )
Intermediate Predictive Analytics in Python

Let's practice!

Intermediate Predictive Analytics in Python

Preparing Video For Download...