Grid Search with Scikit Learn

Hyperparameter Tuning in Python

Alex Scriven

Data Scientist

GridSearchCV Object

 

Introducing a GridSearchCV object:

sklearn.model_selection.GridSearchCV(
    estimator,
    param_grid, scoring=None, fit_params=None,
    n_jobs=None, refit=True, cv='warn',
    verbose=0, pre_dispatch='2*n_jobs',
    error_score='raise-deprecating',
    return_train_score='warn')
Hyperparameter Tuning in Python

Steps in a Grid Search

 

Steps in a Grid Search:

  1. An algorithm to tune the hyperparameters. (Sometimes called an 'estimator')
  2. Defining which hyperparameters we will tune
  3. Defining a range of values for each hyperparameter
  4. Setting a cross-validation scheme; and
  5. Define a score function so we can decide which square on our grid was 'the best'.
  6. Include extra useful information or functions
Hyperparameter Tuning in Python

GridSearchCV Object Inputs

The important inputs are:

  • estimator
  • param_grid
  • cv
  • scoring
  • refit
  • n_jobs
  • return_train_score
Hyperparameter Tuning in Python

GridSearchCV 'estimator'

 

The estimator input:

  • Essentially our algorithm
  • You have already worked with KNN, Random Forest, GBM, Logistic Regression

 

Remember:

  • Only one estimator per GridSearchCV object
Hyperparameter Tuning in Python

GridSearchCV 'param_grid'

The param_grid input:

  • Setting which hyperparameters and values to test

Rather than a list:

max_depth_list = [2, 4, 6, 8]
min_samples_leaf_list = [1, 2, 4, 6]

This would be:

param_grid = {'max_depth': [2, 4, 6, 8],
              'min_samples_leaf': [1, 2, 4, 6]}
Hyperparameter Tuning in Python

GridSearchCV 'param_grid'

The param_grid input:

Remember: The keys in your param_grid dictionary must be valid hyperparameters.

For example, for a Logistic regression estimator:

# Incorrect
param_grid = {'C': [0.1,0.2,0.5],
              'best_choice': [10,20,50]}
ValueError: Invalid parameter best_choice for estimator LogisticRegression
Hyperparameter Tuning in Python

GridSearchCV 'cv'

The cv input:

  • Choice of how to undertake cross-validation
  • Using an integer undertakes k-fold cross validation where 5 or 10 is usually standard

k-fold wikipedia

Hyperparameter Tuning in Python

GridSearchCV 'scoring'

 

The scoring input:

  • Which score to use to choose the best grid square (model)
  • Use your own or Scikit Learn's metrics module

You can check all the built in scoring functions this way:

from sklearn import metrics
sorted(metrics.SCORERS.keys())
Hyperparameter Tuning in Python

GridSearchCV 'refit'

 

The refit input:

  • Fits the best hyperparameters to the training data
  • Allows the GridSearchCV object to be used as an estimator (for prediction)
  • A very handy option!
Hyperparameter Tuning in Python

GridSearchCV 'n_jobs'

The n_jobs input:

  • Assists with parallel execution
  • Allows multiple models to be created at the same time, rather than one after the other

Some handy code:

import os
print(os.cpu_count())

Careful using all your cores for modelling if you want to do other work!

Hyperparameter Tuning in Python

GridSearchCV 'return_train_score'

 

The return_train_score input:

  • Logs statistics about the training runs that were undertaken
  • Useful for analyzing bias-variance trade-off but adds computational expense.
  • Does not assist in picking the best model, only for analysis purposes
Hyperparameter Tuning in Python

Building a GridSearchCV object

 

Building our own GridSearchCV Object:

# Create the grid
param_grid = {'max_depth': [2, 4, 6, 8], 'min_samples_leaf': [1, 2, 4, 6]}

#Get a base classifier with some set parameters. rf_class = RandomForestClassifier(criterion='entropy', max_features='auto')
Hyperparameter Tuning in Python

Building a GridSearchCv Object

 

Putting the pieces together:

grid_rf_class = GridSearchCV(
    estimator = rf_class,
    param_grid = parameter_grid,
    scoring='accuracy',
    n_jobs=4,
    cv = 10,
    refit=True,
    return_train_score=True)
Hyperparameter Tuning in Python

Using a GridSearchCV Object

 

Because we set refit to True we can directly use the object:

#Fit the object to our data
grid_rf_class.fit(X_train, y_train)

# Make predictions
grid_rf_class.predict(X_test)
Hyperparameter Tuning in Python

Let's practice!

Hyperparameter Tuning in Python

Preparing Video For Download...