Feature engineering

Modeling with tidymodels in R

David Svancer

Data Scientist

Feature engineering with the recipes package

Feature engineering workflow with recipes

Modeling with tidymodels in R

Specifying variable types and roles

 

Define column roles

  • Assign outcome or predictor role to all variables

Determine variable data types

  • Numeric data
  • Categorical data

 

Accomplished with the recipe() function

 

Specifying a recipe with the recipe function

Modeling with tidymodels in R

Data preprocessing steps

Add required data preprocessing steps

  • Imputation of missing data
  • Data transformations
    • Centering and scaling numeric variables
  • Creating new variables
    • Calculating ratios of variables
  • And many more...

Each step is added with a unique step_*() function

 

Adding data transformations with step functions

1 https://recipes.tidymodels.org/reference/index.html
Modeling with tidymodels in R

Training preprocessing steps

recipe objects are trained on a data source, typically the training dataset

  • Data transformations are estimated
    • Mean and standard deviation of numeric columns for centering and scaling
    • Formulas for creating new columns are stored for applying to new data

 

Recipes are trained with the prep() function

 

The prep function will train feature engineering steps using the training dataset

Modeling with tidymodels in R

Applying recipes to new data

Apply all trained data preprocessing transformations

  • To the training and test datasets for modeling
  • To new sources of data for future predictions
    • Machine learning algorithms require the same data format as was used during training to predict new values

 

Recipes are applied with the bake() function

 

The bake function will apply a trained recipe to new data sources

Modeling with tidymodels in R

Simple feature engineering pipeline

Log transform total_time in lead scoring data

  • Common transformation for large data values
  • Compresses the range of data values and reduces variability
leads_training
# A tibble: 996 x 7
   purchased  total_visits  total_time  pages_per_visit  total_clicks  lead_source    us_location
   <fct>          <dbl>       <dbl>           <dbl>          <dbl>         <fct>          <fct>      
 1 yes             7          1148            7              59        direct_traffic     west       
 2 no              5           228            2.5            25        email              southeast  
 3 no              7           481            2.33           21        organic_search     west       
 4 no              4           177            4              37        direct_traffic     west       
 5 no              2          1273            2              26        email              midwest         
# ... with 991 more rows
Modeling with tidymodels in R

Building a recipe object

The recipe() function

  • Model formula
    • Assigns variable roles
  • data argument
    • Determines variable data types

 

Pass recipe object to step_log() to add logarithm transformation step

  • Select variable for transformation, total_time, and specify logarithm base
leads_log_rec <- recipe(purchased ~ ., 
                       data = leads_training) %>% 

step_log(total_time, base = 10)
leads_log_rec
Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6

Operations:

Log transformation on total_time
Modeling with tidymodels in R

Explore variable roles and types

Passing a recipe object to the summary() function

  • Creates a tibble with variable information
  • type column
    • Captures data type of variable
    • 'nominal' represents categorical variables
  • role column
    • Captures variable roles for modeling
    • Assigned based on input model formula
leads_log_rec %>% 
  summary()
# A tibble: 7 x 4
  variable        type     role       source  
  <chr>           <chr>    <chr>      <chr>   
1 total_visits    numeric  predictor  original
2 total_time      numeric  predictor  original
3 pages_per_visit numeric  predictor  original
4 total_clicks    numeric  predictor  original
5 lead_source     nominal  predictor  original
6 us_location     nominal  predictor  original
7 purchased       nominal  outcome    original
Modeling with tidymodels in R

Training a recipe object

The prep() function

  • Takes a recipe object as the first argument
  • training argument
    • Specifies the data on which to train data preprocessing steps

 

Printing a trained recipe object

  • Trained steps are indicated by [trained]
leads_log_rec_prep <- leads_log_rec %>% 
  prep(training = leads_training)
leads_log_rec_prep
Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6

Training data contained 996 data points and 
no missing data.

Operations:
Log transformation on total_time [trained]
Modeling with tidymodels in R

Transforming the training data

The bake() function

  • First argument is a trained recipe object
  • new_data argument
    • Data on which to apply trained recipe
  • Training data
    • leads_training was used to train the recipe
    • By default, transformed data is retained by prep() function
    • Pass NULL to new_data to extract
  • Returns a tibble with transformed data
leads_log_rec_prep %>% 
  bake(new_data = NULL)
# A tibble: 996 x 7
  total_visits total_time ... us_location  purchased
    <dbl>       <dbl>     ...   <fct>       <fct>
 1     7        3.06      ...   west         yes
 2     5        2.36      ...   southeast    no
 3     7        2.68      ...   west         no
 4     4        2.25      ...   west         no
 5     2        3.10      ...   midwest      no
# ... with 991 more rows
Modeling with tidymodels in R

Transforming new data

Transforming datasets not used during recipe training

  • Pass dataset to new_data argument
  • Trained recipe will apply all steps to new data sources
leads_log_rec_prep %>% 
  bake(new_data = leads_test)
# A tibble: 332 x 7
 total_visits  total_time ... us_location  purchased
     <dbl>       <dbl>    ...  <fct>       <fct>
 1     8          2       ...  west         no
 2     4          3.13    ...  northeast    yes
 3     3          2.25    ...  west         no
 4     2          1.20    ...  midwest      no
 5     9          3.01    ...  west         yes
# ... with 327 more rows
Modeling with tidymodels in R

Let's get baking!

Modeling with tidymodels in R

Preparing Video For Download...