What is covariate shift?

Monitoring Machine Learning Concepts

Hakim Elakhrass

Co-founder and CEO of NannyML

Definitions

  • covariate variables $=$ input features
  • $P(X)$ changes
  • joint probability $P(Y|X)$ remains the same
  • changes in the joint distribution of the covariates
Monitoring Machine Learning Concepts

Why joint probability distribution?

A graph illustrates both positive and negative correlations between feature one and feature two. The positive correlation is observed at week 10, represented by the color blue, whereas the negative correlation is observed at week 16, indicated by the color red.

Monitoring Machine Learning Concepts

Why does covariate shift occur?

Potential reasons for covariate shift:

  • The real world is not stationary - patterns and trends evolve
  • Changes in data sources - variations in how data is collected between testing and production
  • Evolution of the system and environment
Monitoring Machine Learning Concepts

How does covariate shift occur?

Dynamics of the changes in the distribution:

  • Sudden

 

  • Gradual

 

  • Seasonal

 

The image visually depicts a sudden change in the data distribution. Initially, the distribution is represented by blue points, but after a certain period of time, it abruptly transitions to red points.

The image illustrates a gradual change in the data distribution. Initially, the distribution is represented by blue points, but over time, it gradually transitions to red points before eventually returning to blue. Initially, only one point changes color, but after some time, the entire distribution switches to red.

The image illustrates a seasonal change in the data distribution. Initially, the distribution is represented by blue points, but after some time, it suddenly transitions to red points before eventually returning to blue again. The cycle repeats, showcasing the periodic nature of the data distribution.

Monitoring Machine Learning Concepts

How to detect the covariate shift?

Univariate method

A distribution of production values for each month from september to march. There are seven distributions in total. From September to December, distributions are similar and with small changes within acceptable changes, hence why there are colored blue. From January to March, the distributions are shrinking with the mean moving towards lower values, resulting in red color since the changes are more significant.

Multivariate method

A multivariate drift detection workflow, where the multidimensional data is first compressed to the latent space and decompressed to the initial form with a certain error.

1 https://app.datacamp.com/learn/courses/dimensionality-reduction-in-python
Monitoring Machine Learning Concepts

Let's practice!

Monitoring Machine Learning Concepts

Preparing Video For Download...