Nominal predictors

Modeling with tidymodels in R

David Svancer

Data Scientist

Nominal data

Data that encodes characteristics or groups

  • No meaningful order

Examples

  • Department within a company

    • Marketing, Finance, Technology
  • Native language

    • English, Czech, Spanish ...
  • Car type

    • SUV, sedan, compact ...
Modeling with tidymodels in R

Transforming nominal predictors

Nominal data must be transformed to numeric data for modeling

One-Hot Encoding

  • Maps categorical values to a sequence of [0/1] indicator variables
  • Indicator variable for each unique value in original data

 

one_hot_encoding

Modeling with tidymodels in R

Transforming nominal predictors

Dummy Variable Encoding

  • Excludes one value from original set of data values
    • n distinct values produce ( n - 1 ) indicator variables
  • Preferred method for modeling
    • Default in recipes package

 

dummy_encoding

Modeling with tidymodels in R

Lead scoring data

Nominal predictor variables - lead_source and us_location

leads_training
# A tibble: 996 x 7
 purchased total_visits total_time pages_per_visit total_clicks lead_source   us_location
   <fct>      <dbl>      <dbl>          <dbl>          <dbl>      <fct>          <fct>
 1 yes         7         1148           7              59       direct_traffic    west
 2 no          5         228            2.5            25       email             southeast
 3 no          7         481            2.33           21       organic_search    west
 4 no          4         177            4              37       direct_traffic    west
 5 no          2         1273           2              26       email             midwest
# ... with 991 more rows
Modeling with tidymodels in R

Creating dummy variables

The step_dummy() function

  • Creates dummy variables from nominal predictor variables
recipe(purchased ~ ., data = leads_training) %>%

step_dummy(lead_source, us_location) %>%
prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 12
   total_visits ... lead_source_email  lead_source_organic_search  lead_source_direct_traffic  us_location_southeast ... us_location_west
       <dbl>    ...      <dbl>                 <dbl>                      <dbl>                       <dbl>                     <dbl>
1        8      ...         0                    0                          1                          0                        1
2        4      ...         0                    0                          1                          0                        0
3        3      ...         0                    1                          0                          0                        1
4        2      ...         1                    0                          0                          0                        0
5        9      ...         0                    0                          1                          0                        1

# ... with 327 more rows
Modeling with tidymodels in R

Selecting columns by type

Selecting by column type using all_nominal() and all_outcomes() selectors

  • -all_outcomes() excludes the nominal outcome variable, purchased
recipe(purchased ~ ., data = leads_training) %>%

step_dummy(all_nominal(), -all_outcomes()) %>%
prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 12
   total_visits ... lead_source_email  lead_source_organic_search  lead_source_direct_traffic ... us_location_west
       <dbl>    ...      <dbl>                 <dbl>                      <dbl>                           <dbl>
1        8      ...         0                    0                          1                                 1
2        4      ...         0                    0                          1                                 0
3        3      ...         0                    1                          0                                 1
4        2      ...         1                    0                          0                                 0
5        9      ...         0                    0                          1                                 1
# ... with 327 more rows
Modeling with tidymodels in R

Preprocessing nominal predictor variables

Modeling engines in R

  • Many include automatic dummy variable creation
    • Possible to use nominal predictors without preprocessing with step_dummy()
  • Not consistent across all engines
    • One-hot vs dummy variables
    • Naming of new variables

 

The recipes package provides a standardized way to prepare nominal predictors for modeling

Modeling with tidymodels in R

Let's practice!

Modeling with tidymodels in R

Preparing Video For Download...