Machine Learning with Tree-Based Models in Python
Elie Kawerk
Data Scientist
Machine learning model:
parameters: learned from data
hyperparameters: not learned from data, set prior to training
max_depth
, min_samples_leaf
, splitting criterion ...Problem: search for a set of optimal hyperparameters for a learning algorithm.
Solution: find a set of optimal hyperparameters that results in an optimal model.
Optimal model: yields an optimal score.
Score: in sklearn defaults to accuracy (classification) and $R^2$ (regression).
Cross validation is used to estimate the generalization performance.
In sklearn
, a model's default hyperparameters are not optimal for all problems.
Hyperparameters should be tuned to obtain the best model performance.
Grid Search
Random Search
Bayesian Optimization
Genetic Algorithms
....
Manually set a grid of discrete hyperparameter values.
Set a metric for scoring model performance.
Search exhaustively through the grid.
For each set of hyperparameters, evaluate each model's CV score.
The optimal hyperparameters are those of the model achieving the best CV score.
max_depth
= {2,3,4},min_samples_leaf
= {0.05, 0.1}# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Set seed to 1 for reproducibility
SEED = 1
# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)
# Print out 'dt's hyperparameters
print(dt.get_params())
{'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': None,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'presort': False,
'random_state': 1,
'splitter': 'best'}
# Import GridSearchCV from sklearn.model_selection import GridSearchCV
# Define the grid of hyperparameters 'params_dt' params_dt = { 'max_depth': [3, 4,5, 6], 'min_samples_leaf': [0.04, 0.06, 0.08], 'max_features': [0.2, 0.4,0.6, 0.8] }
# Instantiate a 10-fold CV grid search object 'grid_dt' grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='accuracy', cv=10, n_jobs=-1)
# Fit 'grid_dt' to the training data grid_dt.fit(X_train, y_train)
# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print('Best hyerparameters:\n', best_hyperparams)
Best hyerparameters:
{'max_depth': 3, 'max_features': 0.4, 'min_samples_leaf': 0.06}
# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print('Best CV accuracy'.format(best_CV_score))
Best CV accuracy: 0.938
# Extract best model from 'grid_dt' best_model = grid_dt.best_estimator_
# Evaluate test set accuracy test_acc = best_model.score(X_test,y_test) # Print test set accuracy print("Test set accuracy of best model: {:.3f}".format(test_acc))
Test set accuracy of best model: 0.947
Machine Learning with Tree-Based Models in Python