Pruning the decision tree

Credit Risk Modeling in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Problems with large decision trees

  • Too complex: not clear anymore
  • Overfitting when applying to test set
  • Solution: use printcp(), plotcp() for pruning purposes
Credit Risk Modeling in R

Printcp and tree_undersample

printcp(tree_undersample)
Classification tree:
rpart(formula = loan_status ~ ., data = undersampled_training_set, method = "class",
 control = rpart.control(cp = 0.001))
Variables actually used in tree construction:
age    annual_inc     emp_cat     grade    home_ownership   ir_cat     loan_amnt     
Root node error: 2190/6570 = 0.33333
n= 6570 
        CP    nsplit  rel error   xerror      xstd
1  0.0059361      0    1.00000   1.00000   0.017447
2  0.0044140      4    0.97443   0.99909   0.017443
3  0.0036530      7    0.96119   0.98174   0.017366
4  0.0031963      8    0.95753   0.98904   0.017399
               ...  
16 0.0010654     76    0.84247   1.02511   0.017554
17 0.0010000     79    0.83927   1.02511   0.017554
Credit Risk Modeling in R

Plotcp and tree_undersample

Screen Shot 2020-06-22 at 5.57.10 PM.png

Credit Risk Modeling in R

Plotcp and tree_undersample

Screen Shot 2020-06-22 at 5.56.53 PM.png

$$

$CP = 0.003653$

Credit Risk Modeling in R

Plot the pruned tree

Screen Shot 2020-06-23 at 6.16.04 PM.png

ptree_undersample=prune(tree_undersample,
                        cp = 0.003653)

plot(ptree_undersample,
     uniform=TRUE)

text(ptree_undersample)
Credit Risk Modeling in R

Plot the pruned tree

Screen Shot 2020-06-23 at 6.15.42 PM.png

ptree_undersample=prune(tree_undersample,
                        cp = 0.003653)

plot(ptree_undersample,
     uniform=TRUE)

text(ptree_undersample,
     use.n=TRUE)
Credit Risk Modeling in R

prp() in the rpart.plot-package

Screen Shot 2020-06-22 at 6.05.09 PM.png

library(rpart.plot)
prp(ptree_undersample)
Credit Risk Modeling in R

prp() in the part.plot-package

Screen Shot 2020-06-22 at 6.04.33 PM.png

library(rpart.plot)
prp(ptree_undersample, extra = 1)
Credit Risk Modeling in R

Let's practice!

Credit Risk Modeling in R

Preparing Video For Download...