Modeling with tidymodels in R
David Svancer
Data Scientist
Decision trees segment the predictor space into rectangular regions
Recursive binary splitting
Decision trees segment the predictor space into rectangular regions
Recursive binary splitting
Decision trees segment the predictor space into rectangular regions
Recursive binary splitting
Decision trees segment the predictor space into rectangular regions
Recursive binary splitting
Decision trees segment the predictor space into rectangular regions
Recursive binary splitting
Produces distinct rectangular regions
Interior nodes are dashed lines and terminal nodes are highlighted rectangular regions
Model specification in parsnip
decision_tree()
parsnip
'rpart'
'classification'
or 'regression'
'classification'
dt_model <- decision_tree() %>%
set_engine('rpart') %>%
set_mode('classification')
Data transformations for lead scoring data
recipe
objectTwo R objects to manage
parsnip
model and recipe
specificationleads_recipe <- recipe(purchased ~ ., data = leads_training) %>%
step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric()) %>% step_dummy(all_nominal(), -all_outcomes())
leads_recipe
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()
The workflows
package is designed for streamlining the model process
parsnip
model and recipe
object into a single workflow
object
Initialized with the workflow()
function
add_model()
recipe
object with add_recipe()
recipe
leads_wkfl <- workflow() %>%
add_model(dt_model) %>%
add_recipe(leads_recipe)
leads_wkfl
== Workflow =====================
Preprocessor: Recipe
Model: decision_tree()
-- Preprocessor -----------------
3 Recipe Steps
* step_corr()
* step_normalize()
* step_dummy()
-- Model --------------------------
Decision Tree Model Specification (classification)
Computational engine: rpart
Training a workflow
object
workflow
to last_fit()
and provide data split objectcollect_metrics()
Behind the scenes
recipe
trained and appliedleads_wkfl_fit <- leads_wkfl %>% last_fit(split = leads_split)
leads_wkfl_fit %>% collect_metrics()
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.771
2 roc_auc binary 0.775
A workflow
trained with last_fit()
can be passed to collect_predictions()
yardstick
functions to explore performance custom metricsleads_wkfl_preds <- leads_wkfl_fit %>% collect_predictions()
leads_wkfl_preds
# A tibble: 332 x 6
id .pred_yes .pred_no .row .pred_class purchased
<chr> <dbl> <dbl> <int> <fct> <fct>
train/test split 0.120 0.880 2 no no
train/test split 0.755 0.245 17 yes yes
train/test split 0.120 0.880 21 no no
train/test split 0.120 0.880 22 no no
train/test split 0.755 0.245 24 yes yes
# ... with 327 more rows
Create a custom metric set with metric_set()
Pass predictions datasets to leads_metrics()
to calculate metrics
leads_metrics <- metric_set(roc_auc, sens, spec)
leads_wkfl_preds %>% leads_metrics(truth = purchased, estimate = .pred_class, .pred_yes)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.75
2 spec binary 0.783
3 roc_auc binary 0.775
Financial data for consumer loans at a bank
loan_default
loans_df
# A tibble: 872 x 8
loan_default loan_purpose missed_payment_2_yr loan_amount interest_rate installment annual_income debt_to_income
<fct> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
no debt_consolidation no 25000 5.47 855. 62823 39.4
yes medical no 10000 10.2 364. 40000 24.1
no small_business no 13000 6.22 442. 65000 14.0
no small_business no 36000 5.97 1152. 125000 8.09
yes small_business yes 12000 11.8 308. 65000 20.1
# ... with 867 more rows
Modeling with tidymodels in R