How to grow your tree

Machine Learning with Tree-Based Models in R

Sandro Raabe

Data Scientist

Diabetes dataset

head(diabetes)
# A tibble: 6 x 9
  outcome pregnancies glucose blood_pressure skin_thickness insulin   bmi    age
  <fct>         <int>   <int>          <int>          <int>   <int> <dbl>  <int>
1 yes               6     148             72             35       0  33.6     50
2 no                1      85             66             29       0  26.6     31
3 yes               8     183             64              0       0  23.3     32
Machine Learning with Tree-Based Models in R

Using the whole dataset

  • Used all your data for training - no data left to test the model

model_flow_1

Machine Learning with Tree-Based Models in R

Data split

datasplit

model and evaluation

Machine Learning with Tree-Based Models in R

Splitting methods

splitting methods

splitting_methods2

splitting_methods3

Machine Learning with Tree-Based Models in R

The initial_split() function

  • Splits data randomly into single training and single test set
# Split data proportionally (default: 0.75)
diabetes_split <- initial_split(diabetes, prop = 0.9)
diabetes_split
<Analysis/Assess/Total>
<692/76/768>
1 from the rsample package
Machine Learning with Tree-Based Models in R

Functions training() and testing()

  • Extract training and test sets from a data split
diabetes_train <- training(diabetes_split)

diabetes_test <- testing(diabetes_split)
  • Verification:
    nrow(diabetes_train)/nrow(diabetes)
    
[1] 0.9007812
1 from rsample
Machine Learning with Tree-Based Models in R

Avoid class imbalances

# Training count of 'yes' and 'no' outcomes
counts_train <- table(diabetes_train$outcome)
counts_train
 no yes 
490 86
# Training proportion of 'yes' outcome
prop_yes_train <- counts_train["yes"]/
                  sum(counts_train)
prop_yes_train
0.15
# Test data count of 'yes' and 'no' outcomes
counts_test <- table(diabetes_test$outcome)
counts_test
 no yes 
 28  48
# Test data proportion of 'yes' outcome
prop_yes_test <- counts_test["yes"]/
                  sum(counts_test)
prop_yes_test
0.63
Machine Learning with Tree-Based Models in R

Solution - enforce similar distributions

initial_split(diabetes, 
              prop = 0.9, 
              strata = outcome)
  • Ensures random split with similar distribution of outcome variable
Machine Learning with Tree-Based Models in R

Let's split!

Machine Learning with Tree-Based Models in R

Preparing Video For Download...