Hyperparameter Tuning in Python
Alex Scriven
Data Scientist
Let's analyze the GridSearchCV outputs.
Three different groups for the GridSearchCV properties;
cv_results_best_index_, best_params_ & best_score_scorer_, n_splits_ & refit_time_
Properties are accessed using the dot notation.
For example:
grid_search_object.property
Where property is the actual property you want to retrieve
The cv_results_ property:
Read this into a DataFrame to print and analyze:
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df.shape)
(12, 23)
The time columns refer to the time it took to fit (and score) the model.
Remember how we did a 5-fold cross-validation? This ran 5 times and stored the average and standard deviation of the times it took in seconds.

The param_ columns store the parameters it tested on that row, one column per parameter

The params column contains dictionary of all the parameters:
pd.set_option("display.max_colwidth", -1)
print(cv_results_df.loc[:, "params"])

The test_score columns contain the scores on our test set for each of our cross-folds as well as some summary statistics:

The rank column, ordering the mean_test_score from best to worst:

We can select the best grid square easily from cv_results_ using the rank_test_score column
best_row = cv_results_df[cv_results_df["rank_test_score"] == 1]
print(best_row)

The test_score columns are then repeated for the training_scores.
Some important notes to keep in mind:
return_train_score must be True to include training scores columns.
There is no ranking column for the training scores, as we only care about test set performance
Information on the best grid square is neatly summarized in the following three properties:
best_params_, the dictionary of parameters that gave the best score.
best_score_, the actual best score.
best_index_, the row in our cv_results_.rank_test_score that was the best.
The best_estimator_ property is an estimator built using the best parameters from the grid search.
For us this is a Random Forest estimator:
type(grid_rf_class.best_estimator_)
sklearn.ensemble.forest.RandomForestClassifier
We could also directly use this object as an estimator if we want!
print(grid_rf_class.best_estimator_)

Some extra information is available in the following properties:
scorer_What scorer function was used on the held out data. (we set it to AUC)
n_splits_How many cross-validation splits. (We set to 5)
refit_time_The number of seconds used for refitting the best model on the whole dataset.
Hyperparameter Tuning in Python