Modeling with tidymodels in R
David Svancer
Data Scientist
Creating training and test datasets is the first step in the modeling process
Downside
Resampling technique for exploring model performance
Resampling technique for exploring model performance
Performing 5-fold cross validation
Performing 5-fold cross validation
Performing 5-fold cross validation
Performing 5-fold cross validation
Five estimates of model performance in total
The vfold_cv()
function
v
strata
set.seed()
before vfold_cv()
for reproducibilitysplits
set.seed(214) leads_folds <- vfold_cv(leads_training,
v = 10,
strata = purchased)
leads_folds
# 10-fold cross-validation using stratification
# A tibble: 10 x 2
splits id
<list> <chr>
1 <split [896/100]> Fold01
2 <split [896/100]> Fold02
3 <split [896/100]> Fold03
. ................ ......
9 <split [897/99]> Fold09
10 <split [897/99]> Fold10
The fit_resamples()
function
parsnip
model or workflow
object resamples
metrics
Each metric is estimated 10 times
mean
columnleads_rs_fit <- leads_wkfl %>%
fit_resamples(resamples = leads_folds,
metrics = leads_metrics)
leads_rs_fit %>% collect_metrics()
# A tibble: 3 x 5
.metric .estimator mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 roc_auc binary 0.823 10 0.0147
2 sens binary 0.786 10 0.0203
3 spec binary 0.855 10 0.0159
The collect_metrics()
function
summarize = FALSE
will provide all metric estimates for every cross validation fold.metric
column identifies metric.estimate
column gives estimated value for each foldrs_metrics <- leads_rs_fit %>% collect_metrics(summarize = FALSE)
rs_metrics
# A tibble: 30 x 4
id .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 Fold01 sens binary 0.861
2 Fold01 spec binary 0.891
3 Fold01 roc_auc binary 0.885
4 Fold02 sens binary 0.778
5 Fold02 spec binary 0.969
6 Fold02 roc_auc binary 0.885
# ... with 24 more rows
The collect_metrics()
function returns a tibble
dplyr
rs_metrics
.metric
valuessummarize()
rs_metrics %>%
group_by(.metric) %>%
summarize(min = min(.estimate), median = median(.estimate), max = max(.estimate), mean = mean(.estimate), sd = sd(.estimate))
# A tibble: 3 x 6
.metric min median max mean sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 roc_auc 0.758 0.806 0.885 0.823 0.0466
2 sens 0.667 0.792 0.861 0.786 0.0642
3 spec 0.810 0.843 0.969 0.855 0.0502
Models trained with fit_resamples()
are not able to provide predictions on new data sources
predict()
function does not accept resample objectsPurpose of fit_resample()
predict(leads_rs_fit,
new_data = leads_test)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to
an object of class
"c('resample_results',
'tune_results',
'tbl_df',
'tbl', 'data.frame')"
Modeling with tidymodels in R