Modeling with tidymodels in R
David Svancer
Data Scientist
Outcome variable with two levels
purchased variableNegative class
In tidymodels outcome variable needs to be a factor
levels()leads_df
# A tibble: 1,328 x 7
purchased total_visits ... us_location
<fct> <dbl> ... <fct>
1 yes 7 ... west
2 no 8 ... west
3 no 5 ... southeast
# ... with 1,325 more rows
levels(leads_df[['purchased']])
[1] "yes" "no"
Matrix with counts of all combinations of actual and predicted outcome values
Correct Predictions
Classification Errors
Creating confusion matrices and other model fit metrics with yardstick
purchased.pred_class.pred_yes.pred_noleads_results
# A tibble: 332 x 4
purchased .pred_class .pred_yes .pred_no
<fct> <fct> <dbl> <dbl>
1 no no 0.134 0.866
2 yes yes 0.729 0.271
3 no no 0.133 0.867
4 no no 0.0916 0.908
5 yes yes 0.598 0.402
6 no no 0.128 0.872
7 yes no 0.112 0.888
8 no no 0.169 0.831
9 no no 0.158 0.842
10 yes yes 0.520 0.480
# ... with 322 more rows
The conf_mat() function
truth - column with true outcomesestimate - column with predicted outcomesLogistic regression on leads_df
conf_mat(leads_results,truth = purchased,estimate = .pred_class)
Truth
Prediction yes no
yes 74 34
no 46 178
The accuracy() function
conf_mat()
$$\frac{TP + TN}{TP + TN + FP + FN}$$
yardstick functions always return a tibble.metric - type of metric.estimate - calculated valueaccuracy(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
In many cases accuracy is not the best metric
leads_df data
Sensitivity
Proportion of all positive cases that were correctly classified
The sens() function
conf_mat() and accuracy().estimate columnsens(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.617
Specificity is the proportion of all negative cases that were correctly classified
1 - Specificity
The spec() function
sens().estimate columnspec(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 spec binary 0.840
User-defined metric sets
metric_set() functionyardstick metricsyardstick metric function names into metric_set()custom_metrics <-
metric_set(accuracy, sens, spec)
custom_metrics(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
2 sens binary 0.617
3 spec binary 0.840
Binary classification metrics
Wide variety of binary classification metrics
accuracy(), kap(), sens(), spec(), ppv(), npv(), mcc(), j_index(), bal_accuracy(), detection_prevalence(), precision(), recall(), f_meas()Pass results of conf_mat() to summary() to calculate all
conf_mat(leads_results, truth = purchased,
estimate = .pred_class) %>%
summary()
# A tibble: 13 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
2 kap binary 0.466
3 sens binary 0.617
4 spec binary 0.840
5 ppv binary 0.685
6 npv binary 0.795
7 mcc binary 0.468
8 j_index binary 0.456
9 bal_accuracy binary 0.728
10 detection_prevalence binary 0.325
11 precision binary 0.685
12 recall binary 0.617
13 f_meas binary 0.649
Modeling with tidymodels in R