Outliers in Credit Data

Credit Risk Modeling in Python

Michael Crabtree

Data Scientist, Ford Motor Company

Data processing

  • Prepared data allows models to train faster
  • Often positively impacts model performance

ROC Chart of three different models

Credit Risk Modeling in Python

Outliers and performance

Possible causes of outliers:

  • Problems with data entry systems (human error)
  • Issues with data ingestion tools
Credit Risk Modeling in Python

Outliers and performance

Possible causes of outliers:

  • Problems with data entry systems (human error)
  • Issues with data ingestion tools
Feature Coefficient With Outliers Coefficient Without Outliers
Interest Rate 0.2 0.01
Employment Length 0.5 0.6
Income 0.6 0.75
Credit Risk Modeling in Python

Detecting outliers with cross tables

  • Use cross tables with aggregate functions
pd.crosstab(cr_loan['person_home_ownership'], cr_loan['loan_status'],
            values=cr_loan['loan_int_rate'], aggfunc='mean').round(2)

Credit Risk Modeling in Python

Detecting outliers visually

Detecting outliers visually

  • Histograms
  • Scatter plots

Scatter plot of employment length and loan interest rate

Credit Risk Modeling in Python

Removing outliers

  • Use the .drop() method within Pandas
indices = cr_loan[cr_loan['person_emp_length'] >= 60].index
cr_loan.drop(indices, inplace=True)

Scatter plot of interest rate and employment length without outliers

Credit Risk Modeling in Python

Let's practice!

Credit Risk Modeling in Python

Preparing Video For Download...