Modeling with tidymodels in R
David Svancer
Data Scientist
Creating training and test datasets is the first step in the modeling process
Downside
Resampling technique for exploring model performance

Resampling technique for exploring model performance

Performing 5-fold cross validation

Performing 5-fold cross validation

Performing 5-fold cross validation

Performing 5-fold cross validation
Five estimates of model performance in total

The vfold_cv() function
vstrataset.seed() before vfold_cv() for reproducibilitysplitsset.seed(214) leads_folds <- vfold_cv(leads_training,v = 10,strata = purchased)leads_folds
# 10-fold cross-validation using stratification
# A tibble: 10 x 2
splits id
<list> <chr>
1 <split [896/100]> Fold01
2 <split [896/100]> Fold02
3 <split [896/100]> Fold03
. ................ ......
9 <split [897/99]> Fold09
10 <split [897/99]> Fold10
The fit_resamples() function
parsnip model or workflow object resamplesmetrics
Each metric is estimated 10 times
mean columnleads_rs_fit <- leads_wkfl %>%fit_resamples(resamples = leads_folds,metrics = leads_metrics)leads_rs_fit %>% collect_metrics()
# A tibble: 3 x 5
.metric .estimator mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 roc_auc binary 0.823 10 0.0147
2 sens binary 0.786 10 0.0203
3 spec binary 0.855 10 0.0159
The collect_metrics() function
summarize = FALSE will provide all metric estimates for every cross validation fold.metric column identifies metric.estimate column gives estimated value for each foldrs_metrics <- leads_rs_fit %>% collect_metrics(summarize = FALSE)rs_metrics
# A tibble: 30 x 4
id .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 Fold01 sens binary 0.861
2 Fold01 spec binary 0.891
3 Fold01 roc_auc binary 0.885
4 Fold02 sens binary 0.778
5 Fold02 spec binary 0.969
6 Fold02 roc_auc binary 0.885
# ... with 24 more rows
The collect_metrics() function returns a tibble
dplyrrs_metrics.metric valuessummarize()rs_metrics %>%group_by(.metric) %>%summarize(min = min(.estimate), median = median(.estimate), max = max(.estimate), mean = mean(.estimate), sd = sd(.estimate))
# A tibble: 3 x 6
.metric min median max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 roc_auc 0.758 0.806 0.885 0.823 0.0466
2 sens 0.667 0.792 0.861 0.786 0.0642
3 spec 0.810 0.843 0.969 0.855 0.0502
Models trained with fit_resamples() are not able to provide predictions on new data sources
predict() function does not accept resample objectsPurpose of fit_resample()
predict(leads_rs_fit,
new_data = leads_test)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to
an object of class
"c('resample_results',
'tune_results',
'tbl_df',
'tbl', 'data.frame')"
Modeling with tidymodels in R