Complete modeling workflow

Modeling with tidymodels in R

David Svancer

Data Scientist

Data resampling

Creating training and test datasets

initial_split()
- Create data split object
training()
- Build training dataset
testing()
- Build test dataset

leads_split <- initial_split(leads_df, 
                             strata = purchased)


leads_training <- leads_split %>% 
  training()


leads_test <- leads_split %>% 
  testing()

Model specification

Specify model with parsnip

logistic_reg()
- General interface to logistic regression models
set_engine()
- 'glm' engine
set_mode()
- purchased is a nominal outcome variable
- Mode should be 'classification'

logistic_model <- logistic_reg() %>%

  set_engine('glm') %>%

  set_mode('classification')

Logistic Regression Model 
Specification (classification)

Computational engine: glm

Feature engineering

Specify feature engineering steps with recipes

recipe()
- Model formula and training data
step_*() functions
- Sequential preprocessing steps

leads_recipe <- recipe(purchased ~ .,
                       data = leads_training) %>%

  step_corr(all_numeric(), threshold = 0.9) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

leads_recipe

Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6

Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()

Recipe training

Train feature engineering steps on the training data

prep()
- Pass recipe object to prep()
- Add leads_training for training data

leads_recipe_prep <- leads_recipe %>% 
  prep(training = leads_training)

leads_recipe_prep

Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6
Training data contained 996 data points 
and no missing data.

Operations:
Correlation filter removed pages_per_visit [trained]
Centering and scaling for total_visits ... [trained]
Dummy variables from lead_source, us_location [trained]

Preprocess training data

Apply trained recipe to the training data and save the results for modeling fitting

leads_training_prep <- leads_recipe_prep %>% 
  bake(new_data = NULL)

leads_training_prep

# A tibble: 996 x 11
total_visits  total_time   ... lead_source_email  lead_source_organic_search ...  us_location_west
     <dbl>      <dbl>                <dbl>                      <dbl>                    <dbl>        
 1    0.611      0.958      ...       0                          0            ...         1
 2    0.103     -0.747      ...       1                          0            ...         0
 3    0.611     -0.278      ...       0                          1            ...         1
 4   -0.151     -0.842      ...       0                          0            ...         1
 5   -0.659      1.19       ...       1                          0            ...         0 
# ... with 991 more rows

Preprocess test data

Apply trained recipe to the test data and save the results for modeling evaluation

leads_test_prep <- leads_recipe_prep %>% 
  bake(new_data = leads_test)

leads_test_prep

# A tibble: 332 x 11
 total_visits  total_time  ...  lead_source_email  lead_source_organic_search ...  us_location_west
     <dbl>      <dbl>                <dbl>                      <dbl>                    <dbl>        
 1    0.864     -0.984     ...        0                          0            ...         1
 2   -0.151      1.33      ...        0                          0            ...         0
 3   -0.405     -0.843     ...        0                          1            ...         1
 4   -0.659     -1.14      ...        1                          0            ...         0
 5    1.12       0.725     ...        0                          0            ...         1   
# ... with 327 more rows

Model fitting and predictions

Train logistic regression model with fit()

Use the preprocessed training dataset, leads_training_prep

Obtain model predictions with predict()

Predict outcome values and estimated probabilities
Use the preprocessed test dataset, leads_test_prep

logistic_fit <- logistic_model %>% 
  fit(purchased ~ .,
      data = leads_training_prep)

class_preds <- predict(logistic_fit, 
                       new_data = leads_test_prep,
                       type = 'class')

prob_preds <- predict(logistic_fit, 
                      new_data = leads_test_prep,
                      type = 'prob')

Combining prediction results

Combine predictions into a results dataset for yardstick metric functions

Select the actual outcome variable, purchased from the test dataset
Bind the predictions with bind_cols()

leads_results <- leads_test %>% 
  select(purchased) %>%

  bind_cols(class_preds, prob_preds)


leads_results

# A tibble: 332 x 4
   purchased .pred_class .pred_yes .pred_no
   <fct>     <fct>           <dbl>    <dbl>
 1 no        no             0.257     0.743
 2 yes       yes            0.896     0.104
 3 no        no             0.0852    0.915
 4 no        no             0.183     0.817
 5 yes       yes            0.776     0.224
# ... with 327 more rows

Model evaluation

Evaluate model performance with yardstick

The results data can be used with all yardstick metric functions for model evaluation
Confusion matrix, sensitivity, specificity, and other metrics

leads_results %>% 
  conf_mat(truth = purchased, 
           estimate = .pred_class)

          Truth
Prediction yes  no
       yes  77  34
       no   43 178

Let's practice!

Modeling with tidymodels in R