Feature Engineering in R
Jorge Zazueta
Research Professor and Head of the Modeling Group at the School of Economics, UASLP
Domain knowledge enables us to identify and create relevant and useful features for a particular model or task.
Feature engineering is about creating new input features from existing ones.
Examples of domain knowledge:
We want to predict hotel cancellations based on the following feature vector:
features <-
c("IsCanceled", "LeadTime",
"arrival_date",
"StaysInWeekendNights",
"StaysInWeekNights",
"PreviousCancellations",
"PreviousBookingsNotCanceled",
"ReservedRoomType",
"AssignedRoomType","BookingChanges",
"DepositType","CustomerType",
"ADR","TotalOfSpecialRequests")
Features form raw data
We can generate informative features fromarrival_date
.
But this becomes tedious quickly. We need to automate it!
We'll use a workflow based on tidymodels
, a collection of packages for modeling and machine learning using tidyverse
principles (1) with emphasis on feature engineering.
We can learn more at www.tidymodels.org
Let's start by getting our data ready.
cancelations <-
cancelations %>%
mutate(across(where(is_character),as.factor))
set.seed(123)
split <- cancellations %>%
initial_split(
strata = "IsCanceled")
train <- training(split)
test <- testing(split)
The prop parameter can be used to change the train/test data split (the default is 3/4).
initial_split(data, prop = 3/4, strata = NULL)
Verify that train
and test
sets exhibit similar proportions of canceled reservations.
train %>%
select(IsCanceled) %>% table() %>%
prop.table()
IsCanceled
0 1
0.5826946 0.4173054
test %>%
select(IsCanceled) %>% table() %>%
prop.table()
IsCanceled
0 1
0.5827788 0.4172212
Declare our model
lr_model <- logistic_reg()
Build a recipe
lr_recipe <-
recipe(IsCanceled ~., data = train) %>%
update_role(Agent, new_role = "ID" ) %>%
step_date(arrival_date,
features = c("dow", "week", "month")) %>%
step_holiday(arrival_date,
holidays = timeDate::listHolidays("US")) %>%
step_rm(arrival_date) %>%
step_dummy(all_nominal_predictors())
Print lr_recipe
Recipe
Inputs:
role #variables
ID 1
outcome 1
predictor 13
Operations:
Date features from arrival_date
Holiday features from arrival_date
Variables removed arrival_date
Dummy variables from all_nominal_predictors()
Bundle the model and the recipe into a workflow
object.
lr_workflow <-
workflow()%>%
add_model(lr_model)%>%
add_recipe(lr_recipe)
Fit the workflow
lr_fit <-
lr_workflow %>%
fit(data = train)
We can use tidy(lr_fit)
to summarize our model.
# A tibble: 65 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.92 0.228 -8.43 3.57e- 17
2 LeadTime 0.00414 0.000268 15.4 1.16e- 53
3 StaysInWeekendNights 0.0860 0.0382 2.25 2.45e- 2
4 StaysInWeekNights 0.0804 0.0185 4.34 1.40e- 5
5 PreviousCancellations 2.39 0.147 16.2 2.45e- 59
6 PreviousBookingsNotCanceled -0.440 0.0450 -9.77 1.45e- 22
7 BookingChanges -0.449 0.0463 -9.69 3.18e- 22
8 ADR 0.0104 0.000782 13.2 4.85e- 40
9 TotalOfSpecialRequests -0.727 0.0316 -23.0 5.29e-117
10 arrival_date_week 0.0245 0.0171 1.43 1.53e- 1
# … with 55 more rows
# ? Use `print(n = ...)` to see more rows
We can now assess our model's performance.
lr_aug <- lr_fit %>% augment(test)
bind_rows(
lr_aug %>%
roc_auc(truth = IsCanceled,.pred_0),
lr_aug %>%
accuracy(truth = IsCanceled,.pred_class))
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.842
2 accuracy binary 0.782
lr_aug %>%
roc_curve(truth = IsCanceled, .pred_0) %>%
autoplot()
Feature Engineering in R