Selecting the best model

Modeling with tidymodels in R

David Svancer

Data Scientist

Detailed tuning results

The collect_metrics() function provides summarized results by default

Passing summarize = FALSE will provide all hyperparameter tuning results

dt_tuning %>% 
  collect_metrics(summarize = FALSE)

# A tibble: 150 x 8
 id     cost_complexity tree_depth min_n .metric  ...  .estimate  .config
<chr>        <dbl>         <int>   <int>  <chr>   ...    <dbl>      <chr>  
Fold01    0.0000000758     14       39    sens    ...     0.75     Model1 
Fold01    0.0000000758     14       39    spec    ...     0.906    Model1 
Fold01    0.0000000758     14       39    roc_auc ...     0.888    Model1 
.....     ............     ..       ..    ......  ...     .....    ......
Fold10    0.00380          5        36    roc_auc ...     0.789    Model5

Exploring tuning results

Selecting summarise = FALSE within collect_metrics() returns a tibble

Easy to explore results with dplyr
Exploring ROC AUC
- Select roc_auc metric
- Form groups by id column
- Calculate .estimate summary statistics

dt_tuning %>% 
  collect_metrics(summarize = FALSE) %>% 

  filter(.metric == 'roc_auc') %>%

  group_by(id) %>%

  summarize(min_roc_auc = min(.estimate),
            median_roc_auc = median(.estimate),
            max_roc_auc = max(.estimate))

# A tibble: 10 x 4
 id     min_roc_auc  median_roc_auc  max_roc_auc
<chr>      <dbl>          <dbl>       <dbl>
Fold01     0.830          0.885       0.888
Fold02     0.857          0.882       0.885
Fold03     0.818          0.836       0.836
......     ....           ....        ....
Fold10     0.762          0.790       0.813

Viewing the best performing models

The show_best() function

Displays the top n performing models based on average value of metric
Model1 is the winner

dt_tuning %>% 
  show_best(metric = 'roc_auc', n = 5)

# A tibble: 5 x 9
cost_complexity  tree_depth  min_n  .metric .estimator   mean    n    std_err  .config
    <dbl>           <int>    <int>    <chr>   <chr>      <dbl>  <int>  <dbl>    <chr>
0.0000000758         14       39     roc_auc  binary     0.827   10   0.0147   Model1 
0.00380               5       36     roc_auc  binary     0.825   10   0.0146   Model5 
0.0243                5       34     roc_auc  binary     0.823   10   0.0147   Model2 
0.00000443           11       8      roc_auc  binary     0.816   10   0.00786  Model3 
0.000000600           3       5      roc_auc  binary     0.814   10   0.0131   Model4

Selecting a model

The select_best() function

Pass dt_tuning results to select_best()
Select the metric on which to evaluate performance

Returns a tibble with the best performing model and hyperparameter values

best_dt_model <- dt_tuning %>% 
  select_best(metric = 'roc_auc')


best_dt_model

# A tibble: 1 x 4
cost_complexity tree_depth  min_n  .config
     <dbl>         <int>    <int>   <chr>  
0.0000000758        14       39     Model1

Finalizing the workflow

The finalize_workflow() function will finalize a workflow that contains a model object with tuning parameters

Pass workflow object
A tibble with one row of final model hyperparameter values
- Column names must match hyperparameters in model object

Returns a workflow object with set hyperparameter values

final_leads_wkfl <- leads_tune_wkfl %>% 
  finalize_workflow(best_dt_model)

final_leads_wkfl

== Workflow ========================================
Preprocessor: Recipe
Model: decision_tree()
-- Preprocessor ------------------------------------
3 Recipe Steps
* step_corr()
* step_normalize()
* step_dummy()
-- Model --------------------------------------------
Decision Tree Model Specification (classification)
Main Arguments:
  cost_complexity = 0.0000000758
  tree_depth = 14
  min_n = 39
Computational engine: rpart

Model fitting

Finalized workflow object can be trained with last_fit() and original data split object, leads_split

Behind the scenes

Training and test datasets created
recipe trained and applied
Tuned decision tree trained with entire training dataset
Predictions and metrics on test data

leads_final_fit <- final_leads_wkfl %>% 
  last_fit(split = leads_split)


leads_final_fit %>% 
  collect_metrics()

# A tibble: 2 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.771
2 roc_auc  binary         0.793

Let's practice!

Modeling with tidymodels in R