Building and tuning a random forest model

Machine Learning di Tidyverse

Dmitriy (Dima) Gorenshteyn

Lead Data Scientist, Memorial Sloan Kettering Cancer Center

Cross Validation Performance

Machine Learning di Tidyverse

Cross Validation Performance

Machine Learning di Tidyverse

Cross Validation Performance

Machine Learning di Tidyverse

Cross Validation Performance

Machine Learning di Tidyverse

Linear Regression Model

 

Validate Mean Absolute Error:

1.5 Years

Machine Learning di Tidyverse

Another Model

Machine Learning di Tidyverse

Random Forest Benefits

  • Can handle non-linear relationships
  • Can handle interactions
Machine Learning di Tidyverse

Basic Random Forest Tools

Model
rf_model <- ranger(formula = ___, data = ___, seed = ___)

 

Prediction
prediction <- predict(rf_model, new_data)$predictions
Machine Learning di Tidyverse

Build Basic Random Forest Models

library(ranger)
cv_models_rf <- cv_data %>% 
 mutate(model = map(train, ~ranger(formula = life_expectancy~., 
                                    data = .x, seed = 42)))
cv_prep_rf <- cv_models_rf %>% 
 mutate(validate_predicted = map2(model, validate, 
                                  ~predict(.x, .y)$predictions))
Machine Learning di Tidyverse

ranger Hyper-Parameters

Model
rf_model <- ranger(formula, data, seed, mtry, num.trees)
Hyper-Parameters
name range default
mtry $1:number\ of\ features$ $\sqrt{number\ of\ feat}$
num.trees $1:\infty$ $500$
Machine Learning di Tidyverse

Tune The Hyper-Parameters

cv_tune <- cv_data %>% 
  crossing(mtry = 1:5)

cv_tune
# A tibble: 25 x 5
   splits       id    train                validate            mtry
   <list>       <chr> <list>               <list>             <int>
 1 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     1
 2 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     2
 3 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     3
 4 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     4
 5 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>     5
 6 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [601 × 7]>     1
 7 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [601 × 7]>     2
Machine Learning di Tidyverse

Tune The Hyper-Parameters

cv_model_tunerf <- cv_tune %>% 
  mutate(model = map2(train, mtry, ~ranger(formula = life_expectancy~., 
                                           data = .x, mtry = .y)))

cv_model_tunerf
# A tibble: 25 x 6
   splits       id    train                validate      mtry  model       
 * <list>       <chr> <list>               <list>        <int> <list>      
 1 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   1    <S3: ranger>
 2 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   2    <S3: ranger>
 3 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   3    <S3: ranger>
 4 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   4    <S3: ranger>
 5 <S3: rsplit> Fold1 <tibble [2,402 × 7]> <tibble [60…   5    <S3: ranger>
 6 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [60…   1    <S3: ranger>
 7 <S3: rsplit> Fold2 <tibble [2,402 × 7]> <tibble [60…   2    <S3: ranger>
Machine Learning di Tidyverse

Let's practice!

Machine Learning di Tidyverse

Preparing Video For Download...