Modeling with tidymodels in R
David Svancer
Data Scientist
Creating training and test datasets
initial_split()
training()
testing()
leads_split <- initial_split(leads_df, strata = purchased)
leads_training <- leads_split %>% training()
leads_test <- leads_split %>% testing()
Specify model with parsnip
logistic_reg()
set_engine()
set_mode()
purchased
is a nominal outcome variablelogistic_model <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
Logistic Regression Model
Specification (classification)
Computational engine: glm
Specify feature engineering steps with recipes
recipe()
step_*()
functionsleads_recipe <- recipe(purchased ~ ., data = leads_training) %>%
step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric()) %>% step_dummy(all_nominal(), -all_outcomes())
leads_recipe
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()
Train feature engineering steps on the training data
prep()
recipe
object to prep()
leads_training
for training dataleads_recipe_prep <- leads_recipe %>%
prep(training = leads_training)
leads_recipe_prep
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Training data contained 996 data points
and no missing data.
Operations:
Correlation filter removed pages_per_visit [trained]
Centering and scaling for total_visits ... [trained]
Dummy variables from lead_source, us_location [trained]
Apply trained recipe
to the training data and save the results for modeling fitting
leads_training_prep <- leads_recipe_prep %>% bake(new_data = NULL)
leads_training_prep
# A tibble: 996 x 11
total_visits total_time ... lead_source_email lead_source_organic_search ... us_location_west
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.611 0.958 ... 0 0 ... 1
2 0.103 -0.747 ... 1 0 ... 0
3 0.611 -0.278 ... 0 1 ... 1
4 -0.151 -0.842 ... 0 0 ... 1
5 -0.659 1.19 ... 1 0 ... 0
# ... with 991 more rows
Apply trained recipe
to the test data and save the results for modeling evaluation
leads_test_prep <- leads_recipe_prep %>%
bake(new_data = leads_test)
leads_test_prep
# A tibble: 332 x 11
total_visits total_time ... lead_source_email lead_source_organic_search ... us_location_west
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.864 -0.984 ... 0 0 ... 1
2 -0.151 1.33 ... 0 0 ... 0
3 -0.405 -0.843 ... 0 1 ... 1
4 -0.659 -1.14 ... 1 0 ... 0
5 1.12 0.725 ... 0 0 ... 1
# ... with 327 more rows
Train logistic regression model with fit()
leads_training_prep
Obtain model predictions with predict()
leads_test_prep
logistic_fit <- logistic_model %>%
fit(purchased ~ .,
data = leads_training_prep)
class_preds <- predict(logistic_fit,
new_data = leads_test_prep,
type = 'class')
prob_preds <- predict(logistic_fit,
new_data = leads_test_prep,
type = 'prob')
Combine predictions into a results dataset for yardstick
metric functions
purchased
from the test datasetbind_cols()
leads_results <- leads_test %>% select(purchased) %>%
bind_cols(class_preds, prob_preds)
leads_results
# A tibble: 332 x 4
purchased .pred_class .pred_yes .pred_no
<fct> <fct> <dbl> <dbl>
1 no no 0.257 0.743
2 yes yes 0.896 0.104
3 no no 0.0852 0.915
4 no no 0.183 0.817
5 yes yes 0.776 0.224
# ... with 327 more rows
Evaluate model performance with yardstick
yardstick
metric functions for model evaluationleads_results %>%
conf_mat(truth = purchased,
estimate = .pred_class)
Truth
Prediction yes no
yes 77 34
no 43 178
Modeling with tidymodels in R