Modeling with tidymodels in R
David Svancer
Data Scientist
Define column roles
Determine variable data types
Accomplished with the recipe()
function
Add required data preprocessing steps
Each step is added with a unique step_*()
function
recipe
objects are trained on a data source, typically the training dataset
Recipes are trained with the prep()
function
Apply all trained data preprocessing transformations
Recipes are applied with the bake()
function
Log transform total_time
in lead scoring data
leads_training
# A tibble: 996 x 7
purchased total_visits total_time pages_per_visit total_clicks lead_source us_location
<fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 yes 7 1148 7 59 direct_traffic west
2 no 5 228 2.5 25 email southeast
3 no 7 481 2.33 21 organic_search west
4 no 4 177 4 37 direct_traffic west
5 no 2 1273 2 26 email midwest
# ... with 991 more rows
The recipe()
function
data
argument
Pass recipe
object to step_log()
to add logarithm transformation step
total_time
, and specify logarithm baseleads_log_rec <- recipe(purchased ~ ., data = leads_training) %>%
step_log(total_time, base = 10)
leads_log_rec
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Log transformation on total_time
Passing a recipe
object to the summary()
function
type
columnrole
columnleads_log_rec %>%
summary()
# A tibble: 7 x 4
variable type role source
<chr> <chr> <chr> <chr>
1 total_visits numeric predictor original
2 total_time numeric predictor original
3 pages_per_visit numeric predictor original
4 total_clicks numeric predictor original
5 lead_source nominal predictor original
6 us_location nominal predictor original
7 purchased nominal outcome original
The prep()
function
recipe
object as the first argumenttraining
argument
Printing a trained recipe
object
[trained]
leads_log_rec_prep <- leads_log_rec %>%
prep(training = leads_training)
leads_log_rec_prep
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Training data contained 996 data points and
no missing data.
Operations:
Log transformation on total_time [trained]
The bake()
function
recipe
objectnew_data
argumentrecipe
leads_training
was used to train the recipe
prep()
functionNULL
to new_data
to extractleads_log_rec_prep %>%
bake(new_data = NULL)
# A tibble: 996 x 7
total_visits total_time ... us_location purchased
<dbl> <dbl> ... <fct> <fct>
1 7 3.06 ... west yes
2 5 2.36 ... southeast no
3 7 2.68 ... west no
4 4 2.25 ... west no
5 2 3.10 ... midwest no
# ... with 991 more rows
Transforming datasets not used during recipe
training
new_data
argumentrecipe
will apply all steps to new data sourcesleads_log_rec_prep %>%
bake(new_data = leads_test)
# A tibble: 332 x 7
total_visits total_time ... us_location purchased
<dbl> <dbl> ... <fct> <fct>
1 8 2 ... west no
2 4 3.13 ... northeast yes
3 3 2.25 ... west no
4 2 1.20 ... midwest no
5 9 3.01 ... west yes
# ... with 327 more rows
Modeling with tidymodels in R