Hyperparameter Tuning in Python
Alex Scriven
Data Scientist
Let's analyze the GridSearchCV outputs.
Three different groups for the GridSearchCV properties;
cv_results_
best_index_
, best_params_
& best_score_
scorer_
, n_splits_
& refit_time_
Properties are accessed using the dot notation.
For example:
grid_search_object.property
Where property
is the actual property you want to retrieve
The cv_results_
property:
Read this into a DataFrame to print and analyze:
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df.shape)
(12, 23)
The time
columns refer to the time it took to fit (and score) the model.
Remember how we did a 5-fold cross-validation? This ran 5 times and stored the average and standard deviation of the times it took in seconds.
The param_
columns store the parameters it tested on that row, one column per parameter
The params
column contains dictionary of all the parameters:
pd.set_option("display.max_colwidth", -1)
print(cv_results_df.loc[:, "params"])
The test_score
columns contain the scores on our test set for each of our cross-folds as well as some summary statistics:
The rank column, ordering the mean_test_score
from best to worst:
We can select the best grid square easily from cv_results_
using the rank_test_score
column
best_row = cv_results_df[cv_results_df["rank_test_score"] == 1]
print(best_row)
The test_score
columns are then repeated for the training_scores
.
Some important notes to keep in mind:
return_train_score
must be True
to include training scores columns.
There is no ranking column for the training scores, as we only care about test set performance
Information on the best grid square is neatly summarized in the following three properties:
best_params_
, the dictionary of parameters that gave the best score.
best_score_
, the actual best score.
best_index_
, the row in our cv_results_.rank_test_score
that was the best.
The best_estimator_
property is an estimator built using the best parameters from the grid search.
For us this is a Random Forest estimator:
type(grid_rf_class.best_estimator_)
sklearn.ensemble.forest.RandomForestClassifier
We could also directly use this object as an estimator if we want!
print(grid_rf_class.best_estimator_)
Some extra information is available in the following properties:
scorer_
What scorer function was used on the held out data. (we set it to AUC)
n_splits_
How many cross-validation splits. (We set to 5)
refit_time_
The number of seconds used for refitting the best model on the whole dataset.
Hyperparameter Tuning in Python