Building and tuning a random forest model

Machine Learning in the Tidyverse

Dmitriy (Dima) Gorenshteyn

Lead Data Scientist, Memorial Sloan Kettering Cancer Center

Cross Validation Performance

Machine Learning in the Tidyverse

Cross Validation Performance

Machine Learning in the Tidyverse

Cross Validation Performance

Machine Learning in the Tidyverse

Cross Validation Performance

Machine Learning in the Tidyverse

Linear Regression Model

 

Validate Mean Absolute Error:

1.5 Years

Machine Learning in the Tidyverse

Another Model

Machine Learning in the Tidyverse

Random Forest Benefits

  • Can handle non-linear relationships
  • Can handle interactions
Machine Learning in the Tidyverse

Basic Random Forest Tools

Model
rf_model <- ranger(formula = ___, data = ___, seed = ___)

 

Prediction
prediction <- predict(rf_model, new_data)$predictions
Machine Learning in the Tidyverse

Build Basic Random Forest Models

library(ranger)
cv_models_rf <- cv_data %>% 
 mutate(model = map(train, ~ranger(formula = life_expectancy~., 
                                    data = .x, seed = 42)))
cv_prep_rf <- cv_models_rf %>% 
 mutate(validate_predicted = map2(model, validate, 
                                  ~predict(.x, .y)$predictions))
Machine Learning in the Tidyverse

ranger Hyper-Parameters

Model
rf_model <- ranger(formula, data, seed, mtry, num.trees)
Hyper-Parameters
name range default
mtry $1:number\ of\ features$ $\sqrt{number\ of\ feat}$
num.trees $1:\infty$ $500$
Machine Learning in the Tidyverse

Tune The Hyper-Parameters

cv_tune <- cv_data %>% 
  crossing(mtry = 1:5)

cv_tune
# A tibble: 25 x 5
   splits       id    train                validate            mtry
   <list>       <chr> <list>               <list>             <int>
 1 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     1
 2 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     2
 3 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     3
 4 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     4
 5 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     5
 6 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [601 × 7]>     1
 7 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [601 × 7]>     2
Machine Learning in the Tidyverse

Tune The Hyper-Parameters

cv_model_tunerf <- cv_tune %>% 
  mutate(model = map2(train, mtry, ~ranger(formula = life_expectancy~., 
                                           data = .x, mtry = .y)))

cv_model_tunerf
# A tibble: 25 x 6
   splits       id    train                validate      mtry  model       
 * <list>       <chr> <list>               <list>        <int> <list>      
 1 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   1    <S3: ranger>
 2 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   2    <S3: ranger>
 3 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   3    <S3: ranger>
 4 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   4    <S3: ranger>
 5 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   5    <S3: ranger>
 6 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [60…   1    <S3: ranger>
 7 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [60…   2    <S3: ranger>
Machine Learning in the Tidyverse

Let's practice!

Machine Learning in the Tidyverse

Preparing Video For Download...