Testing data

Developing Machine Learning Models for Production

Sinan Ozdemir

Data Scientist, Entrepreneur, and Author

Data validation and schema tests

Data validation tests look for missing values, inconsistencies, and abnormal data
- E.g. detecting outliers
Schema tests look for expected data formats and data types
- E.g. make sure "time to value" is an integer measuring seconds and not minutes
Tools like Great Expectations help automate this process

great expectations logo

Beyond simple testing

Data and schema tests check for basic issues
More complicated tests like expectation tests look for more complicated issues
- Checking if values are in a certain range
- Checking for known patterns/trends

Expectation tests

A type of data validation test
Make sure data matches certain "expectations" as defined by the user or system
- E.g. expecting time spent on website to be around 4 minutes with a standard deviation of 1 minute
- E.g. expecting dates of a patient's past visits to be before the current time

Feature importance tests

Feature importance tests identify the most important features in an ML model
Example: Permutation importance
- Randomly permute feature values and measures how much model performance decreases

bar chart

Example of permutation importance

Set up:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
# Train a random forest classifier (assuming we have some data)
model = RandomForestClassifier().fit(X_train, y_train)

Running our permutation importance test:

# Calculate feature importances using permutation importance
results = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
# Print the feature importances
feature_names = ['feature_1', 'feature_2', 'feature_3', ...]
importances = results.importances_mean
for i in range(len(feature_names)):
    print(f'{feature_names[i]}: {importances[i]}')

Looking for data drift

Data drift

Sometimes called feature drift
A change in the distribution of a model's input data
E.g. more people are using a new term that corresponds to a new product and a chatbot is scratching it's head

Label drift

A change in the label distribution
E.g. A shift away from people asking for returns and they shift to people asking the status of their returns instead

Let's practice!

Developing Machine Learning Models for Production

Preparing Video For Download...