Numeric predictors

Modeling with tidymodels in R

David Svancer

Data Scientist

Correlated predictor variables

Correlation measures the strength of a linear relationship between two numeric variables

  • Ranges from -1 to 1
  • Highly correlated predictors near -1 or 1

    • Provide redundant information
    • Model fitting problems (multicollinearity)

     

ggplot(leads_training,
       aes(x = pages_per_visit, y = total_clicks)) + 
  geom_point()  + 
  labs(title = 'Total Clicks vs Average Page Visits',
       y = 'Total Clicks', x = 'Average Pages per Visit')

Scatter plot of total clicks versus pages per visit in the lead scoring data

Modeling with tidymodels in R

Finding correlated predictor variables

Calculate a correlation matrix

  • Pass dataset to select_if() function
    • Provide is.numeric as argument
  • Pass to cor() function
leads_training %>%

select_if(is.numeric) %>%
cor()
              total_visits total_time pages_per_visit total_clicks
total_visits        1.00       0.01            0.43         0.42
total_time          0.01       1.00            0.02         0.01
pages_per_visit     0.43       0.02            1.00         0.96
total_clicks        0.42       0.01            0.96         1.00
Modeling with tidymodels in R

Processing correlated predictors

Removing multicollinearity with recipes

  • Specify recipe object with recipe() function
  • Pass to step_corr()
    • Add all numeric columns
      • Column names separated by commas
    • Provide correlation threshold
      • Absolute value
      • Threshold of 0.9 removes correlations at 0.9 or more and -0.9 or less
leads_cor_rec <- recipe(purchased ~ .,
                        data = leads_training) %>%

step_corr(total_visits, total_time, pages_per_visit, total_clicks, threshold = 0.9)
leads_cor_rec
Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6

Operations:
Correlation filter on total_visits,..., total_clicks
Modeling with tidymodels in R

Selecting predictors by type

  • all_outcomes()
    • Selects the outcome variable
  • all_numeric()
    • Selects all numeric variables
      • Will include the outcome variable if it is numeric

To select numeric predictors for recipe steps

  • Pass all_numeric() to step_*() functions
  • If outcome variable is numeric, also pass -all_outcomes()
leads_cor_rec <- recipe(purchased ~ .,
                        data = leads_training) %>%

step_corr(all_numeric(), threshold = 0.9)
leads_cor_rec
Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6

Operations:
Correlation filter on all_numeric()
Modeling with tidymodels in R

Training and applying the recipe

  • Train with prep()
    • Provide leads_training for training
  • Apply with bake()
    • pages_per_visit removed from leads_test
    • pages_per_visit will be removed from all future data as well
leads_cor_rec %>% 

prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 6
total_visits total_time total_clicks ... purchased
    <dbl>       <dbl>        <dbl>   ...   <fct>
 1   8          100           24     ...    no
 2   4          1346          22     ...    yes
 3   3          176           27     ...    no
 4   2          16            12     ...    no
 5   9          1022          12     ...    yes
# ... with 327 more rows
Modeling with tidymodels in R

Normalization

Centering and scaling numeric variables

  • Subtract the mean
  • Divide by the standard deviation
  • Transforms data to standard deviation units
    • Transformed variable will have a mean of 0 and standard deviation of 1

The total_time variable in leads_training

  • Spending 1,273 seconds on the website is 1.19 standard deviations greater than the average time spent by customers

 

Example of normalizing the total time variable from leads training

Modeling with tidymodels in R

Combining data preprocessing steps

Normalizing numeric predictors with recipes

  • step_normalize()
    • Column names or all_numeric() selector
    • Means and standard deviations from training data columns applied to new data sources

Multiple step_*() functions can be added to a recipe

  • Order matters
leads_norm_rec <- recipe(purchased ~ .,
                         data = leads_training) %>%

step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric())
leads_norm_rec
Data Recipe
Inputs:
      role #variables
   outcome          1
 predictor          6

Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Modeling with tidymodels in R

Transforming the test data

pages_per_vist is removed and numeric predictors are normalized

leads_norm_rec %>% 
  prep(training = leads_training) %>% 
  bake(new_data = leads_test)
# A tibble: 332 x 6
 total_visits  total_time  total_clicks  lead_source   us_location  purchased
      <dbl>      <dbl>        <dbl>        <fct>          <fct>      <fct>    
 1    0.864     -0.984     -0.360        direct_traffic   west        no       
 2   -0.151      1.33      -0.506        direct_traffic   northeast   yes      
 3   -0.405     -0.843     -0.140        organic_search   west        no       
 4   -0.659     -1.14      -1.24         email            midwest     no       
 5    1.12       0.725     -1.24         direct_traffic   west        yes           
# ... with 327 more rows
Modeling with tidymodels in R

Let's practice!

Modeling with tidymodels in R

Preparing Video For Download...