Credit Risk Modeling in Python
Michael Crabtree
Data Scientist, Ford Motor Company
loan_status
are the classes0
1
y_train['loan_status'].value_counts()
loan_status | Training Data Count | Percentage of Total |
---|---|---|
0 | 13,798 | 78% |
1 | 3,877 | 22% |
xgboost
use a loss function of log-lossTrue loan status | Predicted probability | Log Loss |
---|---|---|
1 | 0.1 | 2.3 |
0 | 0.9 | 2.3 |
Person | Loan Amount | Potential Profit | Predicted Status | Actual Status | Losses |
---|---|---|---|---|---|
A | $1,000 | $10 | Default | Non-Default | -$10 |
B | $1,000 | $10 | Non-Default | Default | -$1,000 |
Method | Pros | Cons |
---|---|---|
Gather more data | Increases number of defaults | Percentage of defaults may not change |
Penalize models | Increases recall for defaults | Model requires more tuning and maintenance |
Sample data differently | Least technical adjustment | Fewer defaults in data |
loan_status
# Concat the training sets
X_y_train = pd.concat([X_train.reset_index(drop = True),
y_train.reset_index(drop = True)], axis = 1)
# Get the counts of defaults and non-defaults
count_nondefault, count_default = X_y_train['loan_status'].value_counts()
# Separate nondefaults and defaults
nondefaults = X_y_train[X_y_train['loan_status'] == 0]
defaults = X_y_train[X_y_train['loan_status'] == 1]
# Undersample the non-defaults using sample() in pandas
nondefaults_under = nondefaults.sample(count_default)
# Concat the undersampled non-defaults with the defaults
X_y_train_under = pd.concat([nondefaults_under.reset_index(drop = True),
defaults.reset_index(drop = True)], axis=0)
Credit Risk Modeling in Python