Increasing the information content of raw data

Feature Engineering in R

Jorge Zazueta

Research Professor and Head of the Modeling Group at the School of Economics, UASLP

Dealing with raw data

A typical dataset with missing values

Table showing an example of missing data

Values as factors

Table showing an example of factor type data

Dealing with raw data

Dataset with imputed values

Table with no missing data after inputation

Factors represented as dummy variables

Factors represented as dummy variables

The loans dataset

# A tibble: 614 × 13
   Loan_ID  Gender Married Dependents Educa…¹ Self_…² Appli…³ Coapp…? LoanA…? Loan_…?
   <fct>    <fct>  <fct>   <fct>      <fct>   <fct>     <dbl>   <dbl>   <dbl>   <dbl>
 1 LP001002 Male   No      0          Gradua… No         5849       0      NA     360
 2 LP001003 Male   Yes     1          Gradua… No         4583    1508     128     360
 3 LP001005 Male   Yes     0          Gradua… Yes        3000       0      66     360
 4 LP001006 Male   Yes     0          Not Gr… No         2583    2358     120     360
 5 LP001008 Male   No      0          Gradua… No         6000       0     141     360
 6 LP001011 Male   Yes     2          Gradua… Yes        5417    4196     267     360
 7 LP001013 Male   Yes     0          Not Gr… No         2333    1516      95     360
 8 LP001014 Male   Yes     3+         Gradua… No         3036    2504     158     360
 9 LP001018 Male   Yes     2          Gradua… No         4006    1526     168     360
10 LP001020 Male   Yes     1          Gradua… No        12841   10968     349     360
# … with 604 more rows, 3 more variables: Credit_History <dbl>, Property_Area <fct>,
#   Loan_Status <fct>, and abbreviated variable names ¹?Education, ²?Self_Employed,
#   ³?ApplicantIncome, ??CoapplicantIncome, ??LoanAmount, ??Loan_Amount_Term
# ? Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Missing values

We can visually identify missing values in loans using vis_miss(loans)from the package naniar.

Graph showing missing values for the whole dataset.

Missing values

We can zoom the table by selecting only the columns with missing values.

loans %>% 
select(Gender, 
       Married, 
       Dependents,
       Self_Employed,
       LoanAmount,
       Loan_Amount_Term, 
       Credit_History) %>%
  vis_miss()

A closer view of missing values

Graph showing missing values for the selected features.

Missing values and dummy variables

We can address missing values and create dummy variables in the same recipe.

lr_recipe <- 
  recipe(Loan_Status ~., 
         data = train) %>%
  update_role(Loan_ID, 
              new_role = "ID" ) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal_predictors())

Print the recipe

lr_recipe

Recipe

Inputs:

      role #variables
        ID          1
   outcome          1
 predictor         30

Operations:

K-nearest neighbor imputation for all_predictors()
Dummy variables from all_nominal_predictors()

Finding the right recipe step

We can find other imputation methods and all recipe steps in the tidymodels documentations at www.tidymodels.org/find/recipes

Fitting and assessing our model

# Fit
lr_fit <- 
  lr_workflow %>% fit(data = train)
lr_aug <- 
  lr_fit %>% augment(test)
# Assess
lr_aug %>%
  roc_curve(truth = Loan_Status, .pred_N) %>%
  autoplot()

bind_rows(lr_aug %>%
            roc_auc(truth = Loan_Status,
                    .pred_N),
          lr_aug %>%
            accuracy(truth = Loan_Status,
                     .pred_class))

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 roc_auc  binary         0.738
2 accuracy binary         0.792

Let's practice!

Feature Engineering in R