Risk with missing data in loan data

Credit Risk Modeling in Python

Michael Crabtree

Data Scientist, Ford Motor Company

What is missing data?

  • NULLs in a row instead of an actual value
  • An empty string ''
  • Not an entirely empty row
  • Can occur in any column in the data

Sample of data frame with missing employment length

Credit Risk Modeling in Python

Similarities with outliers

  • Negatively affect machine learning model performance
  • May bias models in unanticipated ways
  • May cause errors for some machine learning models
Credit Risk Modeling in Python

Similarities with outliers

  • Negatively affect machine learning model performance
  • May bias models in unanticipated ways
  • May cause errors for some machine learning models
Missing Data Type Possible Result
NULL in numeric column Error
NULL in string column Error
Credit Risk Modeling in Python

How to handle missing data

  • Generally three ways to handle missing data
    • Replace values where the data is missing
    • Remove the rows containing missing data
    • Leave the rows with missing data unchanged
  • Understanding the data determines the course of action
Credit Risk Modeling in Python

How to handle missing data

  • Generally three ways to handle missing data
    • Replace values where the data is missing
    • Remove the rows containing missing data
    • Leave the rows with missing data unchanged
  • Understanding the data determines the course of action
Missing Data Interpretation Action
NULL in loan_status Loan recently approved Remove from prediction data
NULL in person_age Age not recorded or disclosed Replace with median
Credit Risk Modeling in Python

Finding missing data

  • Null values are easily found by using the isnull() function
  • Null records can easily be counted with the sum() function
  • .any() method checks all columns
null_columns = cr_loan.columns[cr_loan.isnull().any()]
cr_loan[null_columns].isnull().sum()
# Total number of null values per column
person_home_ownership          25
person_emp_length             895
loan_intent                    25
loan_int_rate                3140
cb_person_default_on_file      15
Credit Risk Modeling in Python

Replacing Missing data

  • Replace the missing data using methods like .fillna() with aggregate functions and methods
cr_loan['loan_int_rate'].fillna((cr_loan['loan_int_rate'].mean()), inplace = True)

Example of missing interest rate replaced with average

Credit Risk Modeling in Python

Dropping missing data

  • Uses indices to identify records the same as with outliers
  • Remove the records entirely using the .drop() method
indices = cr_loan[cr_loan['person_emp_length'].isnull()].index
cr_loan.drop(indices, inplace=True)
Credit Risk Modeling in Python

Let's practice!

Credit Risk Modeling in Python

Preparing Video For Download...