Cross validation for credit models

Credit Risk Modeling in Python

Michael Crabtree

Data Scientist, Ford Motor Company

Cross validation basics

Used to train and test the model in a way that simulates using the model on new data
Segments training data into different pieces to estimate future performance
Uses DMatrix, an internal structure optimized for XGBoost
Early stopping tells cross validation to stop after a scoring metric has not improved after a number of iterations

How cross validation works

Processes parts of training data as (called folds) and tests against unused part
Final testing against the actual test set

Diagram of k-folds cross validation

¹ https://scikit-learn.org/stable/modules/cross_validation.html

Setting up cross validation within XGBoost

# Set the number of folds
n_folds = 2
# Set early stopping number
early_stop = 5
# Set any specific parameters for cross validation
params = {'objective': 'binary:logistic',
          'seed': 99, 'eval_metric':'auc'}

'binary':'logistic' is used to specify classification for loan_status
'eval_metric':'auc' tells XGBoost to score the model's performance on AUC

Using cross validation within XGBoost

# Restructure the train data for xgboost
DTrain = xgb.DMatrix(X_train, label = y_train)
# Perform cross validation
xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds,
       early_stopping_rounds=early_stop)

DMatrix() creates a special object for xgboost optimized for training

The results of cross validation

Creates a data frame of the values from the cross validation

Example of cross validation scores

Cross validation scoring

Uses cross validation and scoring metrics with cross_val_score() function in scikit-learn

# Import the module
from sklearn.model_selection import cross_val_score
# Create a gbt model
xg = xgb.XGBClassifier(learning_rate = 0.4, max_depth = 10)
# Use cross valudation and accuracy scores 5 consecutive times
cross_val_score(gbt, X_train, y_train, cv = 5)

array([0.92748092, 0.92575308, 0.93975392, 0.93378608, 0.93336163])

Let's practice!

Credit Risk Modeling in Python