Model complexity and overfitting

Designing Machine Learning Workflows in Python

Dr. Chris Anagnostopoulos

Honorary Associate Professor

What is model complexity?

RandomForestClassifier() takes additional arguments, like max_depth:

help(RandomForestClassifier)

Help on class RandomForestClassifier in module sklearn.ensemble.forest:
...
 |  max_depth : integer or None, optional (default=None)
 |      The maximum depth of the tree. If None, then nodes are expanded until
 |      all leaves are pure or until all leaves contain less than
 |      min_samples_split samples.

m2 = RandomForestClassifier(
    max_depth=2)
m2.fit(X_train, y_train)

m2.estimators_[0]

A decision tree with depth 2.

m4 = RandomForestClassifier(
    max_depth=4)
m4.fit(X_train, y_train)

m4.estimators_[0]

A decision tree with depth 4.

The standard practice is to split data into traning, test (or development) and validation (or hold-out).

In cross-validation the train-test split is performed several times. The dataset is spit in N chunks, and a different N-1 of them each time are used for training, with the remaining one used for test.

Cross-validation

Assess accuracy using cross_val_score():

from sklearn.model_selection import cross_val_score

cross_val_score(RandomForestClassifier(), X, y)

array([0.7218 , 0.7682, 0.7866])

numpy.mean(cross_val_score(RandomForestClassifier(), X, y))

0.7589

Tuning model complexity

Tune the tree depth using GridSearchCV():

from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[5,10,20]}
grid = GridSearchCV(RandomForestClassifier(), param_grid)
grid.fit(X,y)
grid._best_params

{'max_depth': 10}

In-sample accuracy starts at 0.7 for a maximum depth of 3, and increases almost to 1.0 as depth ranges from 5 to 30.

Out-of-sample accuracy also starts at 0.7, maxes out at 0.75 for a depth of 10 and then returns back to 0.7 for larger depths than that.

The range from depth 10 onwards is in red, indicating that overfitting is taking place.

More complex is not always better!

Designing Machine Learning Workflows in Python