Missing data and coarse classification

Credit Risk Modeling in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Outlier deleted

loan_status  loan_amnt  int_rate  grade   emp_length  home_ownership   annual_inc   age
     0         5000      12.73      C         12         MORTGAGE       6000000     144
Credit Risk Modeling in R

Missing inputs

 loan_status    loan_amnt  int_rate  grade emp_length  home_ownership annual_inc   age
...      ...          ...        ...   ...        ...            ...          ...  ...
125        0         6000      14.27     C         14       MORTGAGE        94800   23
126        1         2500       7.51     A         NA            OWN        12000   21
127        0        13500       9.91     B          2       MORTGAGE        36000   30
128        0        25000      12.42     B          2           RENT        225000  30
129        0        10000         NA     C          2           RENT        45900   65
130        0         2500      13.49     C          4           RENT        27200   26  
...      ...          ...        ...   ...        ...            ...          ...  ...
2108       0         8000       7.90     A          8           RENT        64000   24
2109       0        12000       8.90     A          0           RENT        38400   26
2110       0         4000         NA     A          7           RENT        48000   30
2111       0         7000       9.91     B         20       MORTGAGE       130000   30
2112       0         7600       6.03     A         41       MORTGAGE        70920   28
2113       0        10000      11.71     B          5           RENT        48132   22
2114       0         8000       6.62     A         17            OWN        42000   24
2115       0         4475         NA     B         NA            OWN        15000   23
2116       0         5750       8.90     A          3           RENT        17000   21
2117       0         4900       6.03     A         12       MORTGAGE        77000   27
…         …          …           …      …          …            …           …      …
Credit Risk Modeling in R

Missing inputs

summary(loan_data$emp_length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   2.000   4.000   6.145   8.000  62.000     809
Credit Risk Modeling in R

Missing inputs: strategies

  • Delete row/column
  • Replace
  • Keep
Credit Risk Modeling in R

Delete rows

index_NA <- which(is.na(loan_data$emp_length)
loan_data_no_NA <- loan_data[-c(index_NA), ]
loan_status   loan_amnt   int_rate grade emp_length  home_ownership  annual_inc  age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Delete column

loan_data_delete_employ <- loan_data
loan_data_delete_employ$emp_length <- NULL
loan_status   loan_amnt   int_rate grade   home_ownership  annual_inc  age
...     ...         ...        ...   ...              ...          ...  ...
125       0        6000      14.27     C         MORTGAGE        94800   23
126       1        2500       7.51     A              OWN        12000   21
127       0       13500       9.91     B         MORTGAGE        36000   30
128       0       25000      12.42     B             RENT        225000  30
129       0       10000         NA     C             RENT        45900   65
130       0        2500      13.49     C             RENT        27200   26  
...     ...         ...        ...   ...              ...          ...  ...
2112      0        7600       6.03     A         MORTGAGE        70920   28
2113      0       10000      11.71     B             RENT        48132   22
2114      0        8000       6.62     A              OWN        42000   24
2115      0        4475         NA     B              OWN        15000   23
2116      0        5750       8.90     A             RENT        17000   21
...     ...         ...        ...   ...              ...          ...  ...
Credit Risk Modeling in R

Replace: median imputation

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Replace: median imputation

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A          4            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B          4            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Keep

  • Keep NA
  • Problem: will cause row deletions for many models
  • Solution: coarse classification, put variable in "bins"
    • New variable emp_cat
    • Range: 0-62 years → make bins of +/- 15 years
    • Categories: "0-15", "15-30", "30-45", "45+", "missing"
Credit Risk Modeling in R

Keep: coarse classification

loan_status  loan_amnt  int_rate  grade  emp_length  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C         14       MORTGAGE        94800   23
126       1         2500       7.51     A         NA            OWN        12000   21
127       0        13500       9.91     B          2       MORTGAGE        36000   30
128       0        25000      12.42     B          2           RENT        225000  30
129       0        10000         NA     C          2           RENT        45900   65
130       0         2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A         41       MORTGAGE        70920   28
2113      0        10000      11.71     B          5           RENT        48132   22
2114      0         8000       6.62     A         17            OWN        42000   24
2115      0         4475         NA     B         NA            OWN        15000   23
2116      0         5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Keep: coarse classification

loan_status  loan_amnt  int_rate  grade     emp_cat  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C      0-15      MORTGAGE        94800   23
126       1         2500       7.51     A   Missing           OWN        12000   21
127       0        13500       9.91     B      0-15      MORTGAGE        36000   30
128       0        25000      12.42     B      0-15          RENT        225000  30
129       0        10000         NA     C      0-15          RENT        45900   65
130       0         2500      13.49     C      0-15          RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A     30-45      MORTGAGE        70920   28
2113      0        10000      11.71     B      0-15          RENT        48132   22
2114      0         8000       6.62     A     15-30           OWN        42000   24
2115      0         4475         NA     B   Missing           OWN        15000   23
2116      0         5750       8.90     A      0-15          RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Bin frequencies

plot(loan_data$emp_cat)

Screen Shot 2020-06-15 at 8.39.14 AM.png

emp_cat
  ...
  0-15
  Missing
  0-15
  0-15
  0-15
  0-15
  ...
  30-45
  0-15
  15-30
  Missing
  0-15
  ...
Credit Risk Modeling in R

Bin frequencies

plot(loan_data$emp_cat)

Screen Shot 2020-06-15 at 8.39.02 AM.png

emp_cat
  ...
  8+
  Missing
  0-2
  0-2
  0-2
  3-4
  ...
  8+
  5-8
  8+
  Missing
  3-4
  ...
Credit Risk Modeling in R

Final remarks

  • Treat outliers as NAs
Credit Risk Modeling in R

Final remarks

  • Treat outliers as NAs

$$

CONTINUOUS CATEGORICAL
DELETE Delete rows (observations with NAs) Delete column (entire variable) Delete rows (observations with NAs) Delete column (entire variable)
REPLACE Replace using median Replace using most frequent category
KEEP Keep as NA (not always possible) Keep using coarse classification NA category
Credit Risk Modeling in R

Let's practice!

Credit Risk Modeling in R

Preparing Video For Download...