Data drift

End-to-End Machine Learning

Joshua Stapleton

Machine Learning Engineer

The need for data drift detection

A graph showing the change in heart disease distribution at various ages over time - today, there are fewer instances of heart disease, and they occur at an older age

End-to-End Machine Learning

The Kolmogorov-Smirnov test

  • Commonly used for detecting data drift
  • Compares differences between dataset samples to determine distributional similarity

Two graphs showing the difference between the distribution of an initial training dataset, and a newer inference dataset which has undergone data drift

End-to-End Machine Learning

Using the ks_2samp() function

  • ks_2samp() function returns two values: test statistic, p-value.
  • Use p-value to accept/reject the null hypothesis of distributional similarity.
from scipy.stats import ks_2samp
# load the 1D data distribution samples for comparison
sample_1, sample_2 = training_dataset_sample, current_inference_sample
# perform the KS-test - ensure input samples are numpy arrays
test_statistic, p_value = ks_2samp(sample_1, sample_2)
if p_value < 0.05:
    print("Reject null hypothesis - data drift might be occuring")
else:
    print("Samples are likely to be from the same dataset")
End-to-End Machine Learning

Correcting data drift

Update model to account for new data

  • Retrain model
  • Re-adjust / update model parameters

Not enough new/inference data?

  • Re-train model on mixed dataset
  • Increase amounts of new data

A flowchart showing the process of retraining and redeploying a model which is experiencing data drift

A diagram showing how to periodically retrain a model on increasing amounts of new data as new data becomes available over time

End-to-End Machine Learning

Further resources for detecting and rectifying data drift

End-to-End Machine Learning

Let's practice!

End-to-End Machine Learning

Preparing Video For Download...