Feature engineering in R
Jorge Zazueta
Research Professor and Head of the Modeling Group at the School of Economics, UASLP
Het weglaten van irrelevante of weinig informatieve variabelen kan voordelen hebben, zoals
Een model fitten met alle features
lr_recipe_full <-
recipe(Loan_Status ~., data = train) %>%
update_role(Loan_ID, new_role = "ID")
lr_workflow_full <-
workflow() %>%
add_model(lr_model) %>%
add_recipe(lr_recipe_full)
lr_fit_full <-
lr_workflow_full %>%
fit(data = train)
vip van variabelen plotten
lr_fit_full %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "steelblue"))
Belang van variabelen

We kunnen features direct toevoegen met de basis R-formulesyntaxis.
# Recept maken
recipe_formula <-
recipe(Loan_Status ~ Credit_History + Property_Area +
LoanAmount, data = train)
# Bundelen met model
workflow_formula <- # Bundelen met model
workflow() %>% add_model(lr_model) %>%
add_recipe(recipe_formula)
Je kunt een features-vector gebruiken om features te selecteren vóór het trainen.
# Features-vector
features <- c("Credit_History", "Property_Area", "LoanAmount", "Loan_Status")
# Train- en testdata
train_features <- train %>% select(all_of(features))
test_features <- test %>% select(all_of(features))
# Recept maken en met model bundelen
recipe_features <- recipe(Loan_Status ~., data = train_features)
workflow_features <- workflow() %>% add_model(lr_model) %>%
add_recipe(recipe_features)
Geaugmenteerde objecten voor beide aanpakken
lr_aug_formula <-
workflow_formula %>%
fit(data = train) %>%
augment(new_data = test)
lr_aug_features <-
workflow_features %>%
fit(data = train_features) %>%
augment(new_data = test_features)
Beide leveren dezelfde resultaten op
all_equal(lr_aug_features,
lr_aug_formula %>%
select(all_of(features),
starts_with(".pred")))
[1] TRUE
Alle features gebruiken
lr_fit_full <- # Workflow fitten
lr_workflow_full %>%
fit(data = train)
lr_aug_full <- # Augmenteren
lr_fit_full %>%
augment(test)
lr_aug_full %>% # Evalueren
class_evaluate(truth = Loan_Status,
estimate = .pred_class,
.pred_Y)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.842
2 roc_auc binary 0.744
Top 3 features gebruiken
lr_fit_formula <- # Workflow fitten
workflow_formula %>%
fit(train)
lr_aug_formula <- # Augmenteren
lr_fit_formula %>%
augment(new_data = test)
lr_aug_formula %>% # Evalueren
class_evaluate(truth = Loan_Status,
estimate = .pred_class,
.pred_Y)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.842
2 roc_auc binary 0.733
Feature engineering in R