Metode shrinkage

Rekayasa Fitur di R

Jorge Zazueta

Research Professor and Head of the Modeling Group at the School of Economics, UASLP

Dua teknik regularisasi umum

Lasso

  • Menambah penalti sebanding nilai absolut bobot model
  • Mendorong sebagian bobot menjadi tepat nol
  • Secara efektif menghapus fitur terkait
  • Dapat menjadi metode seleksi fitur otomatis

Ridge

  • Menambah penalti sebanding kuadrat bobot model
  • Tidak mengecilkan koefisien menjadi nol seperti Lasso
  • Namun efektif mengurangi overfitting
Rekayasa Fitur di R

Tinjauan awal Lasso

Mempersiapkan model

recipe <- # Define recipe
recipe(Loan_Status ~ ., data = train) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>%
  update_role(Loan_ID, new_role = "ID")
# set up model
model_lasso_manual <- logistic_reg() %>%
  set_engine("glmnet") %>%
  set_args(mixture = 1, penalty = .2)
# Bundle in workflow
workflow_lasso_manual <-
  workflow() %>%
  add_model(model_lasso_manual) %>%
  add_recipe(recipe)

Fit dan inspeksi

fit_lasso_manual <- # Fit workflow
  workflow_lasso_manual %>% 
  fit(train)
#Inspect coefficients
tidy(fit_lasso_manual)
# A tibble: 15 × 3
   term                    estimate penalty
   <chr>                      <dbl>   <dbl>
 1 (Intercept)               -0.816     0.2
 2 ApplicantIncome            0         0.2
 3 CoapplicantIncome          0         0.2
 4 LoanAmount                 0         0.2
 5 Loan_Amount_Term           0         0.2
 6 Credit_History            -0.220     0.2
 7 Gender_Female              0         0.2
 ...                          ...        ...
Rekayasa Fitur di R

Regresi logistik sederhana vs. Lasso

Bagan kolom regresi logistik sederhana vs. Lasso.

Rekayasa Fitur di R

Tuning hiperparameter

Menyetel model dengan tuning

model_lasso_tuned <- logistic_reg() %>%
  set_engine("glmnet") %>%
  set_args(mixture = 1, 
  penalty = tune()) 

workflow_lasso_tuned <-
  workflow() %>%
  add_model(model_lasso_tuned) %>%
  add_recipe(recipe)

penalty_grid <- grid_regular(
  penalty(range = c(-3, 1)),
  levels = 30)

Melihat output tuning

tune_output <- tune_grid( 
  workflow_lasso_tuned,
  resamples = vfold_cv(train, v = 5),
  metrics = metric_set(roc_auc),
  grid = penalty_grid)
autoplot(tune_output)

Grafik ROC_AUC vs. tingkat regularisasi.

Rekayasa Fitur di R

Menjelajahi hasil

Fitur terpilih otomatis

best_penalty <- 
select_by_one_std_err(tune_output, 
metric = 'roc_auc', desc(penalty)) 

# Fit model final
final_fit<- 
finalize_workflow(workflow_lasso_tuned, 
best_penalty) %>%
  fit(data = train)
final_fit_se %>% tidy()
# A tibble: 15 × 3
   term                    estimate penalty
   <chr>                      <dbl>   <dbl>
 1 (Intercept)               -0.660  0.0452
 2 ApplicantIncome            0      0.0452
 3 CoapplicantIncome          0      0.0452
 4 LoanAmount                 0      0.0452
 5 Loan_Amount_Term           0      0.0452
 6 Credit_History            -0.948  0.0452
 7 Gender_Female              0      0.0452
 8 Married_Yes               -0.191  0.0452
 9 Dependents_X1              0      0.0452
10 Dependents_X2              0      0.0452
11 Dependents_X3.             0      0.0452
12 Education_Not.Graduate     0      0.0452
13 Self_Employed_Yes          0      0.0452
14 Property_Area_Rural        0      0.0452
15 Property_Area_Semiurban   -0.163  0.0452
Rekayasa Fitur di R

Regresi logistik sederhana vs. Lasso yang dituning

Rekayasa Fitur di R

Regularisasi Ridge

Ridge saat mixture = 0

model_ridge_tuned <- logistic_reg() %>%
  set_engine("glmnet") %>%
  set_args(mixture = 0, penalty = tune()) 

workflow_ridge_tuned <-
  workflow() %>%
  add_model(model_ridge_tuned) %>%
  add_recipe(recipe)

tune_output <- tune_grid( 
  workflow_ridge_tuned,
  resamples = vfold_cv(train, v = 5),
  metrics = metric_set(roc_auc),
  grid = penalty_grid)
tune_output <- tune_grid( 
  workflow_ridge_tuned,
  resamples = vfold_cv(train, v = 5),
  metrics = metric_set(roc_auc),
  grid = penalty_grid)  
 autoplot(tune_output)

Rekayasa Fitur di R

Regularisasi Ridge

best_penalty <- 
select_by_one_std_err(tune_output,
metric = 'roc_auc', desc(penalty)) 
best_penalty

final_fit<- 
finalize_workflow(workflow_ridge_tuned, 
best_penalty) %>%
  fit(data = train)
tidy(final_fit)
# A tibble: 15 × 3
   term                      estimate penalty
   <chr>                        <dbl>   <dbl>
 1 (Intercept)             -0.799          10
 2 ApplicantIncome          0.00232        10
 3 CoapplicantIncome        0.0000537      10
 4 LoanAmount               0.00291        10
 5 Loan_Amount_Term         0.00161        10
 6 Credit_History          -0.0245         10
 7 Gender_Female            0.00850        10
 8 Married_Yes             -0.0140         10
 9 Dependents_X1            0.00497        10
10 Dependents_X2           -0.0100         10
11 Dependents_X3.           0.00259        10
12 Education_Not.Graduate   0.00308        10
13 Self_Employed_Yes        0.00892        10
14 Property_Area_Rural      0.0109         10
15 Property_Area_Semiurban -0.0139         10
Rekayasa Fitur di R

Ridge vs. Lasso

Rekayasa Fitur di R

Ayo berlatih!

Rekayasa Fitur di R

Preparing Video For Download...