Testing data

Developing Machine Learning Models for Production

Sinan Ozdemir

Data Scientist, Entrepreneur, and Author

Data validation and schema tests

  • Data validation tests look for missing values, inconsistencies, and abnormal data
    • E.g. detecting outliers
  • Schema tests look for expected data formats and data types

    • E.g. make sure "time to value" is an integer measuring seconds and not minutes
  • Tools like Great Expectations help automate this process

great expectations logo

Developing Machine Learning Models for Production

Beyond simple testing

  • Data and schema tests check for basic issues
  • More complicated tests like expectation tests look for more complicated issues
    • Checking if values are in a certain range
    • Checking for known patterns/trends
Developing Machine Learning Models for Production

Expectation tests

  • A type of data validation test
  • Make sure data matches certain "expectations" as defined by the user or system
    • E.g. expecting time spent on website to be around 4 minutes with a standard deviation of 1 minute
    • E.g. expecting dates of a patient's past visits to be before the current time
Developing Machine Learning Models for Production

Feature importance tests

  • Feature importance tests identify the most important features in an ML model
  • Example: Permutation importance
    • Randomly permute feature values and measures how much model performance decreases

bar chart

Developing Machine Learning Models for Production

Example of permutation importance

Set up:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
# Train a random forest classifier (assuming we have some data)
model = RandomForestClassifier().fit(X_train, y_train)

Running our permutation importance test:

# Calculate feature importances using permutation importance
results = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
# Print the feature importances
feature_names = ['feature_1', 'feature_2', 'feature_3', ...]
importances = results.importances_mean
for i in range(len(feature_names)):
    print(f'{feature_names[i]}: {importances[i]}')
Developing Machine Learning Models for Production

Looking for data drift

Data drift

  • Sometimes called feature drift
  • A change in the distribution of a model's input data
  • E.g. more people are using a new term that corresponds to a new product and a chatbot is scratching it's head

Label drift

  • A change in the label distribution
  • E.g. A shift away from people asking for returns and they shift to people asking the status of their returns instead
Developing Machine Learning Models for Production

Let's practice!

Developing Machine Learning Models for Production

Preparing Video For Download...