Feature Engineering in R
Jorge Zazueta
Research Professor and Head of the Modeling Group at the School of Economics, UASLP
La conoscenza del dominio ci aiuta a identificare e creare feature rilevanti per un modello o compito.
La feature engineering crea nuove feature di input da quelle esistenti.
Esempi di conoscenza del dominio:
Vogliamo prevedere le cancellazioni hotel usando il seguente vettore di feature:
features <-
c("IsCanceled", "LeadTime",
"arrival_date",
"StaysInWeekendNights",
"StaysInWeekNights",
"PreviousCancellations",
"PreviousBookingsNotCanceled",
"ReservedRoomType",
"AssignedRoomType","BookingChanges",
"DepositType","CustomerType",
"ADR","TotalOfSpecialRequests")
Feature da dati grezzi
Possiamo generare feature informative da arrival_date.

Ma diventa presto noioso. Serve automatizzare!
Useremo un workflow basato su tidymodels, una raccolta di pacchetti per modellazione e ML nei principi tidyverse (1), con enfasi sulla feature engineering.

Per saperne di più: www.tidymodels.org
Iniziamo preparando i dati.
cancelations <-
cancelations %>%
mutate(across(where(is_character),as.factor))
set.seed(123)
split <- cancellations %>%
initial_split(
strata = "IsCanceled")
train <- training(split)
test <- testing(split)
Il parametro prop modifica la divisione train/test (default 3/4).
initial_split(data, prop = 3/4, strata = NULL)
Verifica che i set train e test abbiano proporzioni simili di prenotazioni cancellate.
train %>%
select(IsCanceled) %>% table() %>%
prop.table()
IsCanceled
0 1
0.5826946 0.4173054
test %>%
select(IsCanceled) %>% table() %>%
prop.table()
IsCanceled
0 1
0.5827788 0.4172212
Dichiara il modello
lr_model <- logistic_reg()
Crea una recipe
lr_recipe <-
recipe(IsCanceled ~., data = train) %>%
update_role(Agent, new_role = "ID" ) %>%
step_date(arrival_date,
features = c("dow", "week", "month")) %>%
step_holiday(arrival_date,
holidays = timeDate::listHolidays("US")) %>%
step_rm(arrival_date) %>%
step_dummy(all_nominal_predictors())
Stampa lr_recipe
Recipe
Inputs:
role #variables
ID 1
outcome 1
predictor 13
Operations:
Date features from arrival_date
Holiday features from arrival_date
Variables removed arrival_date
Dummy variables from all_nominal_predictors()
Raggruppa modello e recipe in un oggetto workflow.
lr_workflow <-
workflow()%>%
add_model(lr_model)%>%
add_recipe(lr_recipe)
Fitta il workflow
lr_fit <-
lr_workflow %>%
fit(data = train)
Possiamo riassumere il modello con tidy(lr_fit).
# A tibble: 65 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.92 0.228 -8.43 3.57e- 17
2 LeadTime 0.00414 0.000268 15.4 1.16e- 53
3 StaysInWeekendNights 0.0860 0.0382 2.25 2.45e- 2
4 StaysInWeekNights 0.0804 0.0185 4.34 1.40e- 5
5 PreviousCancellations 2.39 0.147 16.2 2.45e- 59
6 PreviousBookingsNotCanceled -0.440 0.0450 -9.77 1.45e- 22
7 BookingChanges -0.449 0.0463 -9.69 3.18e- 22
8 ADR 0.0104 0.000782 13.2 4.85e- 40
9 TotalOfSpecialRequests -0.727 0.0316 -23.0 5.29e-117
10 arrival_date_week 0.0245 0.0171 1.43 1.53e- 1
# … with 55 more rows
# ℹ Use `print(n = ...)` to see more rows
Ora possiamo valutare le prestazioni del modello.
lr_aug <- lr_fit %>% augment(test)
bind_rows(
lr_aug %>%
roc_auc(truth = IsCanceled,.pred_0),
lr_aug %>%
accuracy(truth = IsCanceled,.pred_class))
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.842
2 accuracy binary 0.782
lr_aug %>%
roc_curve(truth = IsCanceled, .pred_0) %>%
autoplot()

Feature Engineering in R