How to detect covariate shift

Monitoring Machine Learning Concepts

Hakim Elakhrass

Co-founder and CEO of NannyML

Multivariate drift detection

  • Looks for changes in joint distribution

 

 

  • Uses the PCA algorithm for data compression

 

 

  • Uses reconstruction error as a measure of drift

The image shows a multivariate drift detection workflow, where the multidimensional data is initially compressed to the latent space and then decompressed back to its original form with a certain reconstruction.

The graph illustrates the fluctuations in data reconstruction drift error time.

Monitoring Machine Learning Concepts

Univariate drift detection

Types of variables:

  • Categorical - represent types of data which may be divided into groups like martial status, smoking status, level of education

 

  • Continuous - a variable with an infinite number of real values within a given interval like height, weight, distance, time
Monitoring Machine Learning Concepts

Continuous methods - Jensen-Shannon

  • Measures the similarity of two distributions

  • Range [0, 1]

  • Catches meaningful low-magnitude drifts

The image shows change in the distribution measured by Jensen-Shannon distance.

Monitoring Machine Learning Concepts

Continuous methods - Wasserstein

  • The minimum effort needed to transform one distribution into another

  • Range [0, +inf]

  • Sensitive to outliers

The image shows change in the distribution measured by Wasserstein distance.

Monitoring Machine Learning Concepts

Continuous methods - Kolmogorov-Smirnov

  • Maximum distance of the cumulative distribution functions

  • Range [0, 1]

  • Prone to false positives

The image shows change in the distribution measured by  Kolmogorov-Smirnov distance.

Monitoring Machine Learning Concepts

Continuous methods - Hellinger

  • Overlap between distributions
  • Range [0, 1]
  • Doesn't differentiate between strong shifts

 

Continuous methods - Recommendation

  • Jensen-Shannon and Wasserstein generally perform well

The image shows change in the distribution measured by Hellinger distance.

Monitoring Machine Learning Concepts

Categorical methods - Chi-squared

  • Sensitive in changes for low-frequency categories

The image shows a visualization of the chi-squared statistic for a categorical variable with two categories, a and b.

Monitoring Machine Learning Concepts

Categorical methods - L-infinity

  • Identifies the most significant shift across all categories

The image shows a visualization of the L-Infinity method for a categorical variable with three categories, a, b, and c.

Monitoring Machine Learning Concepts

Categorical methods - Jensen-Shannon and Hellinger

  • Jensen-Shannon or L-Infinity when dealing with many categories
  • L-Infinity distance to detect changes in individual categories

The image shows a visualization of the Jensen-Shannon and Hellinger methods for a categorical variable with three categories, a, b, and c.

Monitoring Machine Learning Concepts

Let's practice!

Monitoring Machine Learning Concepts

Preparing Video For Download...