Testing models

Developing Machine Learning Models for Production

Sinan Ozdemir

Data Scientist, Entrepreneur, and Author

Individual vs Group fairness

Individual fairness

Treat similar individuals similarly.

E.g. Give similar job opportunities to individuals with similar experiences.

Group fairness

Treat different groups equally.

E.g. Do not discriminate against individuals from certain racial or ethnic groups in school admittance models.

Holdout testing

Holdout testing checks model reliability using a separate dataset
Assesses performance on untrained data, like at training time
Detects issues like overfitting or underfitting

robots using data

Looking for model drift

Concept drift

A shift in the relationship between the features and the response
- E.g. a sentiment analysis algorithm
  - Terms can shift in meaning (saying something is "sick" is a good thing)

Prediction drift

A shift in a model's prediction distribution
- E.g. a temporary outage leads to an increase to the "outage" intent for your chatbot
- Nothing is "wrong" per say but this drift could lead to slower response times for users

Example of looking for model drift

Setup:

# Import necessary libraries and split data
X_train, X_test, y_train, y_test = ...

Fitting our model:

# Train a classifier on the training data
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Calculate the accuracy on test data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Testing for drift later:

X_test_drift = X_test + 1.0  # Simulate data drift
# Calculate the accuracy on drifted data
y_pred_drift = clf.predict(X_test_drift)
accuracy_drift = accuracy_score(
    y_test, y_pred_drift)
print(f"Accuracy with drift: {accuracy_drift}")

# Drift detection threshold based on the accuracy
drift_threshold = accuracy * 0.9

# Check for drop in accuracy on the drifted data
if accuracy_drift < drift_threshold:
    print("Concept drift detected!")
else:
    print("No concept drift detected.")

Cost of complex models vs baseline models

Complex models are more expensive and can affect model efficiency
Latency measures the time to process a single input and generate a prediction
Throughput measures the number of predictions a model can make in a given time
Test latency and throughput to identify how much complexity a model should have
Goal is a balance between accuracy and efficiency

Let's practice!

Developing Machine Learning Models for Production