Modeling with tidymodels in R
David Svancer
Data Scientist
Outcome variable with two levels
purchased
variableNegative class
In tidymodels
outcome variable needs to be a factor
levels()
leads_df
# A tibble: 1,328 x 7
purchased total_visits ... us_location
<fct> <dbl> ... <fct>
1 yes 7 ... west
2 no 8 ... west
3 no 5 ... southeast
# ... with 1,325 more rows
levels(leads_df[['purchased']])
[1] "yes" "no"
Matrix with counts of all combinations of actual and predicted outcome values
Correct Predictions
Classification Errors
Creating confusion matrices and other model fit metrics with yardstick
purchased
.pred_class
.pred_yes
.pred_no
leads_results
# A tibble: 332 x 4
purchased .pred_class .pred_yes .pred_no
<fct> <fct> <dbl> <dbl>
1 no no 0.134 0.866
2 yes yes 0.729 0.271
3 no no 0.133 0.867
4 no no 0.0916 0.908
5 yes yes 0.598 0.402
6 no no 0.128 0.872
7 yes no 0.112 0.888
8 no no 0.169 0.831
9 no no 0.158 0.842
10 yes yes 0.520 0.480
# ... with 322 more rows
The conf_mat()
function
truth
- column with true outcomesestimate
- column with predicted outcomesLogistic regression on leads_df
conf_mat(leads_results,
truth = purchased,
estimate = .pred_class)
Truth
Prediction yes no
yes 74 34
no 46 178
The accuracy()
function
conf_mat()
$$\frac{TP + TN}{TP + TN + FP + FN}$$
yardstick
functions always return a tibble.metric
- type of metric.estimate
- calculated valueaccuracy(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
In many cases accuracy is not the best metric
leads_df
data
Sensitivity
Proportion of all positive cases that were correctly classified
The sens()
function
conf_mat()
and accuracy()
.estimate
columnsens(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.617
Specificity is the proportion of all negative cases that were correctly classified
1 - Specificity
The spec()
function
sens()
.estimate
columnspec(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 spec binary 0.840
User-defined metric sets
metric_set()
functionyardstick
metricsyardstick
metric function names into metric_set()
custom_metrics <-
metric_set(accuracy, sens, spec)
custom_metrics(leads_results,
truth = purchased,
estimate = .pred_class)
# A tibble: 3 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
2 sens binary 0.617
3 spec binary 0.840
Binary classification metrics
Wide variety of binary classification metrics
accuracy()
, kap()
, sens()
, spec()
, ppv()
, npv()
, mcc()
, j_index()
, bal_accuracy()
, detection_prevalence()
, precision()
, recall()
, f_meas()
Pass results of conf_mat()
to summary()
to calculate all
conf_mat(leads_results, truth = purchased,
estimate = .pred_class) %>%
summary()
# A tibble: 13 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.759
2 kap binary 0.466
3 sens binary 0.617
4 spec binary 0.840
5 ppv binary 0.685
6 npv binary 0.795
7 mcc binary 0.468
8 j_index binary 0.456
9 bal_accuracy binary 0.728
10 detection_prevalence binary 0.325
11 precision binary 0.685
12 recall binary 0.617
13 f_meas binary 0.649
Modeling with tidymodels in R