Iterating without overfitting

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

A schematic indicating that a machine learning pipeline can be further tuned after it has been pushed to production.

Designing Machine Learning Workflows in Python

In this same schematic, it is indicated that more data can also be extracted from production to help with further tuning the model.

Designing Machine Learning Workflows in Python

In this same schematic, it is shown that domain experts can give new insights that might change the model, such as new loss functions.

Designing Machine Learning Workflows in Python

In this same schematic, the model that has been pushed to production is labelled as the champion, and the one in development is the challenger.

Designing Machine Learning Workflows in Python

Cross-validation results

grid_search = GridSearchCV(pipe, params, cv=3, return_train_score=True)
gs = grid_search.fit(X_train, y_train)
results = pd.DataFrame(gs.cv_results_)
results[['mean_train_score', 'std_train_score', 
   'mean_test_score', 'std_test_score']]
   mean_train_score  std_train_score  mean_test_score  std_test_score
0             0.829            0.006            0.735           0.009
1             0.829            0.006            0.725           0.009
2             0.961            0.008            0.716           0.019
3             0.981            0.005            0.749           0.024
...
Designing Machine Learning Workflows in Python

Cross-validation results

   mean_train_score  std_train_score  mean_test_score  std_test_score
0             0.829            0.006            0.735           0.009
1             0.829            0.006            0.725           0.009
2             0.961            0.008            0.716           0.019
3             0.981            0.005            0.749           0.024
4             0.986            0.003            0.728           0.009
5             0.995            0.002            0.751           0.008

Observations:

  • Training score much higher than test score.
  • The standard deviation of the test score is large.
Designing Machine Learning Workflows in Python

A dataset split into training and test, with training further split into chunks by cross-validation.

Designing Machine Learning Workflows in Python

A dataset split into training and validation, with training further split into chunks by cross-validation.

Designing Machine Learning Workflows in Python

Detecting overfitting

  • CV Training Score >> CV Test Score
    • overfitting in model fitting stage
    • reduce complexity of classifier
    • get more training data
    • increase cv number
  • CV Test Score >> Validation Score
    • overfitting in model tuning stage
    • decrease cv number
    • decrease size of parameter grid

A dataset split into training and validation, with training further split into chunks by cross-validation.

Designing Machine Learning Workflows in Python

A dataset split into training and validation, with training further split into chunks by cross-validation.

Designing Machine Learning Workflows in Python

A dataset split into training, validation and production data, with training further split into chunks by cross-validation.

Designing Machine Learning Workflows in Python

"Expert in CV" in your CV!

Designing Machine Learning Workflows in Python

Preparing Video For Download...