Modellazione con tidymodels in R
David Svancer
Data Scientist
Crea i dataset di training e test
initial_split()training()testing()leads_split <- initial_split(leads_df, strata = purchased)leads_training <- leads_split %>% training()leads_test <- leads_split %>% testing()
Specifica il modello con parsnip
logistic_reg()set_engine()set_mode()purchased è una variabile di outcome nominalelogistic_model <- logistic_reg() %>%set_engine('glm') %>%set_mode('classification')
Logistic Regression Model
Specification (classification)
Computational engine: glm
Specifica i passaggi di feature engineering con recipes
recipe()step_*()leads_recipe <- recipe(purchased ~ ., data = leads_training) %>%step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric()) %>% step_dummy(all_nominal(), -all_outcomes())
leads_recipe
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()
Allena i passaggi di feature engineering sui dati di training
prep()recipe a prep()leads_training come dati di trainingleads_recipe_prep <- leads_recipe %>%
prep(training = leads_training)
leads_recipe_prep
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Training data contained 996 data points
and no missing data.
Operations:
Correlation filter removed pages_per_visit [trained]
Centering and scaling for total_visits ... [trained]
Dummy variables from lead_source, us_location [trained]
Applica la recipe addestrata ai dati di training e salva i risultati per il fitting del modello
leads_training_prep <- leads_recipe_prep %>% bake(new_data = NULL)leads_training_prep
# A tibble: 996 x 11
total_visits total_time ... lead_source_email lead_source_organic_search ... us_location_west
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.611 0.958 ... 0 0 ... 1
2 0.103 -0.747 ... 1 0 ... 0
3 0.611 -0.278 ... 0 1 ... 1
4 -0.151 -0.842 ... 0 0 ... 1
5 -0.659 1.19 ... 1 0 ... 0
# ... with 991 more rows
Applica la recipe addestrata ai dati di test e salva i risultati per la valutazione del modello
leads_test_prep <- leads_recipe_prep %>%
bake(new_data = leads_test)
leads_test_prep
# A tibble: 332 x 11
total_visits total_time ... lead_source_email lead_source_organic_search ... us_location_west
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.864 -0.984 ... 0 0 ... 1
2 -0.151 1.33 ... 0 0 ... 0
3 -0.405 -0.843 ... 0 1 ... 1
4 -0.659 -1.14 ... 1 0 ... 0
5 1.12 0.725 ... 0 0 ... 1
# ... with 327 more rows
Allena il modello di regressione logistica con fit()
leads_training_prep
Ottieni le previsioni con predict()
leads_test_preplogistic_fit <- logistic_model %>%
fit(purchased ~ .,
data = leads_training_prep)
class_preds <- predict(logistic_fit,
new_data = leads_test_prep,
type = 'class')
prob_preds <- predict(logistic_fit,
new_data = leads_test_prep,
type = 'prob')
Combina le previsioni in un dataset di risultati per le metriche di yardstick
purchased dal dataset di testbind_cols()leads_results <- leads_test %>% select(purchased) %>%bind_cols(class_preds, prob_preds)leads_results
# A tibble: 332 x 4
purchased .pred_class .pred_yes .pred_no
<fct> <fct> <dbl> <dbl>
1 no no 0.257 0.743
2 yes yes 0.896 0.104
3 no no 0.0852 0.915
4 no no 0.183 0.817
5 yes yes 0.776 0.224
# ... with 327 more rows
Valuta le prestazioni del modello con yardstick
yardstick per la valutazioneleads_results %>%
conf_mat(truth = purchased,
estimate = .pred_class)
Truth
Prediction yes no
yes 77 34
no 43 178
Modellazione con tidymodels in R