Cross validation for credit models

Credit Risk Modeling in Python

Michael Crabtree

Data Scientist, Ford Motor Company

Cross validation basics

  • Used to train and test the model in a way that simulates using the model on new data
  • Segments training data into different pieces to estimate future performance
  • Uses DMatrix, an internal structure optimized for XGBoost
  • Early stopping tells cross validation to stop after a scoring metric has not improved after a number of iterations
Credit Risk Modeling in Python

How cross validation works

  • Processes parts of training data as (called folds) and tests against unused part
  • Final testing against the actual test set

Diagram of k-folds cross validation

1 https://scikit-learn.org/stable/modules/cross_validation.html
Credit Risk Modeling in Python

Setting up cross validation within XGBoost

# Set the number of folds
n_folds = 2
# Set early stopping number
early_stop = 5
# Set any specific parameters for cross validation
params = {'objective': 'binary:logistic',
          'seed': 99, 'eval_metric':'auc'}
  • 'binary':'logistic' is used to specify classification for loan_status
  • 'eval_metric':'auc' tells XGBoost to score the model's performance on AUC
Credit Risk Modeling in Python

Using cross validation within XGBoost

# Restructure the train data for xgboost
DTrain = xgb.DMatrix(X_train, label = y_train)
# Perform cross validation
xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds,
       early_stopping_rounds=early_stop)
  • DMatrix() creates a special object for xgboost optimized for training
Credit Risk Modeling in Python

The results of cross validation

  • Creates a data frame of the values from the cross validation

Example of cross validation scores

Credit Risk Modeling in Python

Cross validation scoring

  • Uses cross validation and scoring metrics with cross_val_score() function in scikit-learn
# Import the module
from sklearn.model_selection import cross_val_score
# Create a gbt model
xg = xgb.XGBClassifier(learning_rate = 0.4, max_depth = 10)
# Use cross valudation and accuracy scores 5 consecutive times
cross_val_score(gbt, X_train, y_train, cv = 5)
array([0.92748092, 0.92575308, 0.93975392, 0.93378608, 0.93336163])
Credit Risk Modeling in Python

Let's practice!

Credit Risk Modeling in Python

Preparing Video For Download...