Pemodelan dengan tidymodels di R
David Svancer
Data Scientist
Membuat dataset latih dan uji
initial_split()training()testing()leads_split <- initial_split(leads_df, strata = purchased)leads_training <- leads_split %>% training()leads_test <- leads_split %>% testing()
Spesifikasikan model dengan parsnip
logistic_reg()set_engine()set_mode()purchased adalah variabel keluaran nominallogistic_model <- logistic_reg() %>%set_engine('glm') %>%set_mode('classification')
Logistic Regression Model
Specification (classification)
Computational engine: glm
Tentukan langkah rekayasa fitur dengan recipes
recipe()step_*()leads_recipe <- recipe(purchased ~ ., data = leads_training) %>%step_corr(all_numeric(), threshold = 0.9) %>% step_normalize(all_numeric()) %>% step_dummy(all_nominal(), -all_outcomes())
leads_recipe
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Operations:
Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()
Latih langkah rekayasa fitur pada data latih
prep()recipe ke prep()leads_training sebagai data latihleads_recipe_prep <- leads_recipe %>%
prep(training = leads_training)
leads_recipe_prep
Data Recipe
Inputs:
role #variables
outcome 1
predictor 6
Training data contained 996 data points
and no missing data.
Operations:
Correlation filter removed pages_per_visit [trained]
Centering and scaling for total_visits ... [trained]
Dummy variables from lead_source, us_location [trained]
Terapkan recipe terlatih ke data latih dan simpan hasilnya untuk pemodelan
leads_training_prep <- leads_recipe_prep %>% bake(new_data = NULL)leads_training_prep
# A tibble: 996 x 11
total_visits total_time ... lead_source_email lead_source_organic_search ... us_location_west
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.611 0.958 ... 0 0 ... 1
2 0.103 -0.747 ... 1 0 ... 0
3 0.611 -0.278 ... 0 1 ... 1
4 -0.151 -0.842 ... 0 0 ... 1
5 -0.659 1.19 ... 1 0 ... 0
# ... with 991 more rows
Terapkan recipe terlatih ke data uji dan simpan hasilnya untuk evaluasi model
leads_test_prep <- leads_recipe_prep %>%
bake(new_data = leads_test)
leads_test_prep
# A tibble: 332 x 11
total_visits total_time ... lead_source_email lead_source_organic_search ... us_location_west
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.864 -0.984 ... 0 0 ... 1
2 -0.151 1.33 ... 0 0 ... 0
3 -0.405 -0.843 ... 0 1 ... 1
4 -0.659 -1.14 ... 1 0 ... 0
5 1.12 0.725 ... 0 0 ... 1
# ... with 327 more rows
Latih model regresi logistik dengan fit()
leads_training_prep
Dapatkan prediksi model dengan predict()
leads_test_preplogistic_fit <- logistic_model %>%
fit(purchased ~ .,
data = leads_training_prep)
class_preds <- predict(logistic_fit,
new_data = leads_test_prep,
type = 'class')
prob_preds <- predict(logistic_fit,
new_data = leads_test_prep,
type = 'prob')
Gabungkan prediksi menjadi dataset hasil untuk fungsi metrik yardstick
purchased dari data ujibind_cols()leads_results <- leads_test %>% select(purchased) %>%bind_cols(class_preds, prob_preds)leads_results
# A tibble: 332 x 4
purchased .pred_class .pred_yes .pred_no
<fct> <fct> <dbl> <dbl>
1 no no 0.257 0.743
2 yes yes 0.896 0.104
3 no no 0.0852 0.915
4 no no 0.183 0.817
5 yes yes 0.776 0.224
# ... with 327 more rows
Evaluasi kinerja model dengan yardstick
yardstick untuk evaluasi modelleads_results %>%
conf_mat(truth = purchased,
estimate = .pred_class)
Truth
Prediction yes no
yes 77 34
no 43 178
Pemodelan dengan tidymodels di R