Testing models

Developing Machine Learning Models for Production

Sinan Ozdemir

Data Scientist, Entrepreneur, and Author

Individual vs Group fairness

Individual fairness

Treat similar individuals similarly.

  • E.g. Give similar job opportunities to individuals with similar experiences.

Group fairness

Treat different groups equally.

  • E.g. Do not discriminate against individuals from certain racial or ethnic groups in school admittance models.
Developing Machine Learning Models for Production

Holdout testing

  • Holdout testing checks model reliability using a separate dataset
  • Assesses performance on untrained data, like at training time
  • Detects issues like overfitting or underfitting

robots using data

Developing Machine Learning Models for Production

Looking for model drift

Concept drift

  • A shift in the relationship between the features and the response
    • E.g. a sentiment analysis algorithm
      • Terms can shift in meaning (saying something is "sick" is a good thing)

Prediction drift

  • A shift in a model's prediction distribution
    • E.g. a temporary outage leads to an increase to the "outage" intent for your chatbot
    • Nothing is "wrong" per say but this drift could lead to slower response times for users
Developing Machine Learning Models for Production

Example of looking for model drift

Setup:

# Import necessary libraries and split data
X_train, X_test, y_train, y_test = ...

Fitting our model:

# Train a classifier on the training data
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Calculate the accuracy on test data
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Testing for drift later:

X_test_drift = X_test + 1.0  # Simulate data drift
# Calculate the accuracy on drifted data
y_pred_drift = clf.predict(X_test_drift)
accuracy_drift = accuracy_score(
    y_test, y_pred_drift)
print(f"Accuracy with drift: {accuracy_drift}")

# Drift detection threshold based on the accuracy
drift_threshold = accuracy * 0.9

# Check for drop in accuracy on the drifted data
if accuracy_drift < drift_threshold:
    print("Concept drift detected!")
else:
    print("No concept drift detected.")
Developing Machine Learning Models for Production

Cost of complex models vs baseline models

  • Complex models are more expensive and can affect model efficiency
  • Latency measures the time to process a single input and generate a prediction
  • Throughput measures the number of predictions a model can make in a given time
  • Test latency and throughput to identify how much complexity a model should have
  • Goal is a balance between accuracy and efficiency
Developing Machine Learning Models for Production

Let's practice!

Developing Machine Learning Models for Production

Preparing Video For Download...