Iterating without overfitting

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

A schematic indicating that a machine learning pipeline can be further tuned after it has been pushed to production.

In this same schematic, it is indicated that more data can also be extracted from production to help with further tuning the model.

In this same schematic, it is shown that domain experts can give new insights that might change the model, such as new loss functions.

In this same schematic, the model that has been pushed to production is labelled as the champion, and the one in development is the challenger.

Cross-validation results

grid_search = GridSearchCV(pipe, params, cv=3, return_train_score=True)
gs = grid_search.fit(X_train, y_train)
results = pd.DataFrame(gs.cv_results_)

results[['mean_train_score', 'std_train_score', 
   'mean_test_score', 'std_test_score']]

   mean_train_score  std_train_score  mean_test_score  std_test_score
0             0.829            0.006            0.735           0.009
1             0.829            0.006            0.725           0.009
2             0.961            0.008            0.716           0.019
3             0.981            0.005            0.749           0.024
...

Cross-validation results

   mean_train_score  std_train_score  mean_test_score  std_test_score
0             0.829            0.006            0.735           0.009
1             0.829            0.006            0.725           0.009
2             0.961            0.008            0.716           0.019
3             0.981            0.005            0.749           0.024
4             0.986            0.003            0.728           0.009
5             0.995            0.002            0.751           0.008

Observations:

Training score much higher than test score.
The standard deviation of the test score is large.

A dataset split into training and test, with training further split into chunks by cross-validation.

A dataset split into training and validation, with training further split into chunks by cross-validation.

Detecting overfitting

CV Training Score >> CV Test Score
- overfitting in model fitting stage
- reduce complexity of classifier
- get more training data
- increase cv number
CV Test Score >> Validation Score
- overfitting in model tuning stage
- decrease cv number
- decrease size of parameter grid

A dataset split into training and validation, with training further split into chunks by cross-validation.

A dataset split into training, validation and production data, with training further split into chunks by cross-validation.

"Expert in CV" in your CV!

Designing Machine Learning Workflows in Python