Logistic regression for probability of default

Credit Risk Modeling in Python

Michael Crabtree

Data Scientist, Ford Motor Company

Probability of default

  • The likelihood that someone will default on a loan is the probability of default
  • A probability value between 0 and 1 like 0.86
  • loan_status of 1 is a default or 0 for non-default
Credit Risk Modeling in Python

Probability of default

  • The likelihood that someone will default on a loan is the probability of default
  • A probability value between 0 and 1 like 0.86
  • loan_status of 1 is a default or 0 for non-default
Probability of Default Interpretation Predicted loan status
0.4 Unlikely to default 0
0.90 Very likely to default 1
0.1 Very unlikely to default 0
Credit Risk Modeling in Python

Predicting probabilities

  • Probabilities of default as an outcome from machine learning
    • Learn from data in columns (features)
  • Classification models (default, non-default)
  • Two most common models:
    • Logistic regression
    • Decision tree

Example of logistic regression and decision tree

Credit Risk Modeling in Python

Logistic regression

  • Similar to the linear regression, but only produces values between 0 and 1

Formula for linear regression and logistic regression

Example graph of linear regression and logistic regression

Credit Risk Modeling in Python

Training a logistic regression

  • Logistic regression available within the scikit-learn package
from sklearn.linear_model import LogisticRegression
  • Called as a function with or without parameters
clf_logistic = LogisticRegression(solver='lbfgs')
  • Uses the method .fit() to train
clf_logistic.fit(training_columns, np.ravel(training_labels))
  • Training Columns: all of the columns in our data except loan_status
  • Labels: loan_status(0,1)
Credit Risk Modeling in Python

Training and testing

  • Entire data set is usually split into two parts
Credit Risk Modeling in Python

Training and testing

  • Entire data set is usually split into two parts
Data Subset Usage Portion
Train Learn from the data to generate predictions 60%
Test Test learning on new unseen data 40%
Credit Risk Modeling in Python

Creating the training and test sets

  • Separate the data into training columns and labels
X = cr_loan.drop('loan_status', axis = 1)
y = cr_loan[['loan_status']]
  • Use train_test_split() function already within sci-kit learn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=123)
  • test_size: percentage of data for test set
  • random_state: a random seed value for reproducibility
Credit Risk Modeling in Python

Let's practice!

Credit Risk Modeling in Python

Preparing Video For Download...