Modeling with tidymodels in R
David Svancer
Data Scientist
Decision trees segment the predictor space into rectangular regions
Recursive binary splitting

Decision trees segment the predictor space into rectangular regions
Recursive binary splitting

Decision trees segment the predictor space into rectangular regions
Recursive binary splitting

Decision trees segment the predictor space into rectangular regions
Recursive binary splitting

Decision trees segment the predictor space into rectangular regions
Recursive binary splitting
Produces distinct rectangular regions


Interior nodes are dashed lines and terminal nodes are highlighted rectangular regions

Model specification in parsnip
decision_tree()parsnip'rpart''classification' or 'regression''classification'dt_model <- decision_tree() %>%set_engine('rpart') %>%set_mode('classification')
Data transformations for lead scoring data
recipe objectTwo R objects to manage
parsnip model and recipe specificationleads_recipe <- recipe(purchased ~ ., data = leads_training) %>%step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric()) %>% step_dummy(all_nominal(), -all_outcomes())
leads_recipe
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()
The workflows package is designed for streamlining the model process
parsnip model and recipe object into a single workflow object
Initialized with the workflow() function
add_model()recipe object with add_recipe()recipeleads_wkfl <- workflow() %>%add_model(dt_model) %>%add_recipe(leads_recipe)leads_wkfl
== Workflow =====================
Preprocessor: Recipe
Model: decision_tree()
-- Preprocessor -----------------
3 Recipe Steps
* step_corr()
* step_normalize()
* step_dummy()
-- Model --------------------------
Decision Tree Model Specification (classification)
Computational engine: rpart
Training a workflow object
workflow to last_fit() and provide data split objectcollect_metrics()Behind the scenes
recipe trained and appliedleads_wkfl_fit <- leads_wkfl %>% last_fit(split = leads_split)leads_wkfl_fit %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.771
2 roc_auc binary 0.775
A workflow trained with last_fit() can be passed to collect_predictions()
yardstick functions to explore performance custom metricsleads_wkfl_preds <- leads_wkfl_fit %>% collect_predictions()leads_wkfl_preds
# A tibble: 332 x 6
id .pred_yes .pred_no .row .pred_class purchased
<chr> <dbl> <dbl> <int> <fct> <fct>
train/test split 0.120 0.880 2 no no
train/test split 0.755 0.245 17 yes yes
train/test split 0.120 0.880 21 no no
train/test split 0.120 0.880 22 no no
train/test split 0.755 0.245 24 yes yes
# ... with 327 more rows
Create a custom metric set with metric_set()
Pass predictions datasets to leads_metrics() to calculate metrics
leads_metrics <- metric_set(roc_auc, sens, spec)leads_wkfl_preds %>% leads_metrics(truth = purchased, estimate = .pred_class, .pred_yes)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.75
2 spec binary 0.783
3 roc_auc binary 0.775
Financial data for consumer loans at a bank
loan_default
loans_df
# A tibble: 872 x 8
loan_default loan_purpose missed_payment_2_yr loan_amount interest_rate installment annual_income debt_to_income
<fct> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
no debt_consolidation no 25000 5.47 855. 62823 39.4
yes medical no 10000 10.2 364. 40000 24.1
no small_business no 13000 6.22 442. 65000 14.0
no small_business no 36000 5.97 1152. 125000 8.09
yes small_business yes 12000 11.8 308. 65000 20.1
# ... with 867 more rows
Modeling with tidymodels in R