Model complexity and overfitting

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

What is model complexity?

RandomForestClassifier() takes additional arguments, like max_depth:

help(RandomForestClassifier)
Help on class RandomForestClassifier in module sklearn.ensemble.forest:
...
 |  max_depth : integer or None, optional (default=None)
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.
Designing Machine Learning Workflows in Python
m2 = RandomForestClassifier(
    max_depth=2)
m2.fit(X_train, y_train)

m2.estimators_[0]

A decision tree with depth 2.

m4 = RandomForestClassifier(
    max_depth=4)
m4.fit(X_train, y_train)

m4.estimators_[0]

A decision tree with depth 4.

Designing Machine Learning Workflows in Python

The standard practice is to split data into traning, test (or development) and validation (or hold-out).

Designing Machine Learning Workflows in Python

In cross-validation the train-test split is performed several times. The dataset is spit in N chunks, and a different N-1 of them each time are used for training, with the remaining one used for test.

Designing Machine Learning Workflows in Python

Cross-validation

Assess accuracy using cross_val_score():

from sklearn.model_selection import cross_val_score

cross_val_score(RandomForestClassifier(), X, y)
array([0.7218 , 0.7682, 0.7866])
numpy.mean(cross_val_score(RandomForestClassifier(), X, y))
0.7589
Designing Machine Learning Workflows in Python

Tuning model complexity

Tune the tree depth using GridSearchCV():

from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[5,10,20]}
grid = GridSearchCV(RandomForestClassifier(), param_grid)
grid.fit(X,y)
grid._best_params
{'max_depth': 10}
Designing Machine Learning Workflows in Python

In-sample accuracy starts at 0.7 for a maximum depth of 3, and increases almost to 1.0 as depth ranges from 5 to 30.

Designing Machine Learning Workflows in Python

Out-of-sample accuracy also starts at 0.7, maxes out at 0.75 for a depth of 10 and then returns back to 0.7 for larger depths than that.

Designing Machine Learning Workflows in Python

The range from depth 10 onwards is in red, indicating that overfitting is taking place.

Designing Machine Learning Workflows in Python

More complex is not always better!

Designing Machine Learning Workflows in Python

Preparing Video For Download...