Data outliers and scaling

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Outliers

  • One or more observations that are distant from the rest of the observations in a given feature.

Data outliers

1 https://bolt.mph.ufl.edu/6050-6052/unit-1/one-quantitative-variable-introduction/understanding-outliers/
Practicing Machine Learning Interview Questions in Python

Inter-quartile range (IQR)

Data IQR

1 By Jhguch at en.wikipedia, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=14524285
Practicing Machine Learning Interview Questions in Python

Line of best fit

Linear model fit

1 https://www.r-bloggers.com/outlier-detection-and-treatment-with-r/
Practicing Machine Learning Interview Questions in Python

Outlier functions

Function returns
sns.boxplot(x= , y='Loan Status') boxplot conditioned on target variable
sns.distplot() histogram and kernel density estimate (kde)
np.abs() returns absolute value
stats.zscore() calculated z-score
mstats.winsorize(limits=[0.05, 0.05]) floor and ceiling applied to outliers
np.where(condition, true, false) replaced values
Practicing Machine Learning Interview Questions in Python

High vs low variance

Variance

1 https://machinelearningmastery.com/a-gentle-introduction-to-calculating-normal-summary-statistics/
Practicing Machine Learning Interview Questions in Python

Standardization vs normalization

  • Standardization:
    • Z-score standardization
    • Scales to mean 0 and sd 1

Formula z-score standardization

  • Normalization:
    • Min/max normalizing
    • Scales to between (0, 1)

Formula min-max scaling

1 https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc
Practicing Machine Learning Interview Questions in Python

Scaling functions

  • scikit-learn.preprocessing.StandardScaler() --> (mean=0, sd=1)
  • sklearn.preprocessing.MinMaxScaler() --> (0,1)
Practicing Machine Learning Interview Questions in Python

Outliers and scaling

How should outliers be identified and properly dealt with? What result does min/max or z-score standardization have on data? Select the statement that is true:

  • An outlier is a point that is just outside the range of similar points in a feature.
  • In a given context, outliers considered anomalous are helpful in building a predictive ML model.
  • Mix/max scaling gives data a mean of 0, an SD of 1, and increases variance.
  • Z-score standardization scales data to be in the interval (0,1) and improves model fit.
Practicing Machine Learning Interview Questions in Python

Outliers and scaling: answer

How should outliers be identified and properly dealt with? What result does min/max or z-score standardization have on data? The correct answer is:

  • In a given context, outliers considered anomalous are helpful in building a predictive ML model. (Data anomalies are common in fraud detection, cybersecurity events, and other scenarios where the goal is to find them.)
Practicing Machine Learning Interview Questions in Python

Outliers and scaling: incorrect answers

How should outliers be identified and properly dealt with? What result does min/max or z-score standardization have on data?

  • An outlier is just outside the range of similar points in a feature. (A point is not suspected of being an outlier until more than 1.5 times beyond the IQR.)
  • Mix/max scaling gives data a mean of 0, an SD of 1, and increases variance. (Min/max scaling scales data to be in the interval (0,1) and it depends on the original data whether or not variance is increased or decreased.)
  • Z-score standardization scales data to be in the interval (0,1) and improves model fit. (Z-score standardization scales the data to have mean 0 and sd of 1, which can improve model fit.)
Practicing Machine Learning Interview Questions in Python

One last thing...

Preprocessing steps

Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...