Random forest models

Dimensionality Reduction in R

Matt Pickard

Owner, Pickard Predictives, LLC

Random Forest

An ensemble model
- a "wisdom of the crowds" approach
Aggregates predictions of many random trees
Random uncorrelated trees mitigate error
Avoids overfitting
Accurate
Performs feature selection

A diagram showing an ensemble model consisting of several decision trees and how the their votes are combined into one final vote.

Random Forest

This diagram shows how different subtrees are created using different subsets of features.

Train a Random Forest

library(tidymodels)


rf <- rand_forest(mode = "classification",  trees = 200) %>% 
  set_engine("ranger", importance = "impurity")


rf_fit <- rf %>% 
  fit(credit_score ~ ., data = train)


predict_df <- test %>%  
  bind_cols(predict = predict(rf_fit, test))

Evaluate the Model

f_meas(predict_df, credit_score, .pred_class)

0.6895

Variable Importance

library(vip)

rf_fit %>% vip()

A variable importance bar chart.

Feature Mask

top_features <- rf_fit %>% 
  vi(rank = TRUE) %>% 
  filter(Importance <= 10) %>% 
  pull(Variable)

top_features

 [1] "outstanding_debt"        "interest_rate"          
 [3] "delay_from_due_date"     "changed_credit_limit"   
 [5] "credit_history_months"   "num_credit_card"        
 [7] "monthly_balance"         "num_of_delayed_payment" 
 [9] "annual_income"           "amount_invested_monthly"

Reduce the data

train_reduced <- train[top_features]
test_reduced <- test[top_features]

Performance

rf_fit <- rf %>% 
  fit(credit_score ~ ., data = train_reduced) 

predict_reduced_df <- test_reduced %>%  
  bind_cols(predict = predict(rf_fit, test_reduced)) 

f_meas(predict_reduced_df, credit_score, .pred_class)

0.6738

F-score of the unreduced model:

0.6895

Let's practice!

Dimensionality Reduction in R