Modeling with tidymodels in R
David Svancer
Data Scientist
Correlation measures the strength of a linear relationship between two numeric variables
Highly correlated predictors near -1 or 1
ggplot(leads_training,
aes(x = pages_per_visit, y = total_clicks)) +
geom_point() +
labs(title = 'Total Clicks vs Average Page Visits',
y = 'Total Clicks', x = 'Average Pages per Visit')
Calculate a correlation matrix
select_if()
functionis.numeric
as argumentcor()
functionleads_training %>%
select_if(is.numeric) %>%
cor()
total_visits total_time pages_per_visit total_clicks
total_visits 1.00 0.01 0.43 0.42
total_time 0.01 1.00 0.02 0.01
pages_per_visit 0.43 0.02 1.00 0.96
total_clicks 0.42 0.01 0.96 1.00
Removing multicollinearity with recipes
recipe
object with recipe()
functionstep_corr()
threshold
leads_cor_rec <- recipe(purchased ~ ., data = leads_training) %>%
step_corr(total_visits, total_time, pages_per_visit, total_clicks, threshold = 0.9)
leads_cor_rec
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on total_visits,..., total_clicks
all_outcomes()
all_numeric()
To select numeric predictors for recipe
steps
all_numeric()
to step_*()
functions-all_outcomes()
leads_cor_rec <- recipe(purchased ~ ., data = leads_training) %>%
step_corr(all_numeric(), threshold = 0.9)
leads_cor_rec
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
prep()
leads_training
for trainingbake()
pages_per_visit
removed from leads_test
pages_per_visit
will be removed from all future data as wellleads_cor_rec %>%
prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 6
total_visits total_time total_clicks ... purchased
<dbl> <dbl> <dbl> ... <fct>
1 8 100 24 ... no
2 4 1346 22 ... yes
3 3 176 27 ... no
4 2 16 12 ... no
5 9 1022 12 ... yes
# ... with 327 more rows
Centering and scaling numeric variables
The total_time
variable in leads_training
Normalizing numeric predictors with recipes
step_normalize()
all_numeric()
selectorMultiple step_*()
functions can be added to a recipe
leads_norm_rec <- recipe(purchased ~ ., data = leads_training) %>%
step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric())
leads_norm_rec
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
pages_per_vist
is removed and numeric predictors are normalized
leads_norm_rec %>%
prep(training = leads_training) %>%
bake(new_data = leads_test)
# A tibble: 332 x 6
total_visits total_time total_clicks lead_source us_location purchased
<dbl> <dbl> <dbl> <fct> <fct> <fct>
1 0.864 -0.984 -0.360 direct_traffic west no
2 -0.151 1.33 -0.506 direct_traffic northeast yes
3 -0.405 -0.843 -0.140 organic_search west no
4 -0.659 -1.14 -1.24 email midwest no
5 1.12 0.725 -1.24 direct_traffic west yes
# ... with 327 more rows
Modeling with tidymodels in R