Modeling with tidymodels in R
David Svancer
Data Scientist
Data that encodes characteristics or groups
Examples
Department within a company
Native language
Car type
Nominal data must be transformed to numeric data for modeling
One-Hot Encoding
Dummy Variable Encoding
recipes
package
Nominal predictor variables - lead_source
and us_location
leads_training
# A tibble: 996 x 7
purchased total_visits total_time pages_per_visit total_clicks lead_source us_location
<fct> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 yes 7 1148 7 59 direct_traffic west
2 no 5 228 2.5 25 email southeast
3 no 7 481 2.33 21 organic_search west
4 no 4 177 4 37 direct_traffic west
5 no 2 1273 2 26 email midwest
# ... with 991 more rows
The step_dummy()
function
recipe(purchased ~ ., data = leads_training) %>%
step_dummy(lead_source, us_location) %>%
prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 12
total_visits ... lead_source_email lead_source_organic_search lead_source_direct_traffic us_location_southeast ... us_location_west
<dbl> ... <dbl> <dbl> <dbl> <dbl> <dbl>
1 8 ... 0 0 1 0 1
2 4 ... 0 0 1 0 0
3 3 ... 0 1 0 0 1
4 2 ... 1 0 0 0 0
5 9 ... 0 0 1 0 1
# ... with 327 more rows
Selecting by column type using all_nominal()
and all_outcomes()
selectors
-all_outcomes()
excludes the nominal outcome variable, purchased
recipe(purchased ~ ., data = leads_training) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 12
total_visits ... lead_source_email lead_source_organic_search lead_source_direct_traffic ... us_location_west
<dbl> ... <dbl> <dbl> <dbl> <dbl>
1 8 ... 0 0 1 1
2 4 ... 0 0 1 0
3 3 ... 0 1 0 1
4 2 ... 1 0 0 0
5 9 ... 0 0 1 1
# ... with 327 more rows
Modeling engines in R
step_dummy()
The recipes
package provides a standardized way to prepare nominal predictors for modeling
Modeling with tidymodels in R