Dati mancanti e classificazione grossolana

Credit Risk Modeling in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Outlier eliminato

loan_status  loan_amnt  int_rate  grade   emp_length  home_ownership   annual_inc   age
     0         5000      12.73      C         12         MORTGAGE       6000000     144
Credit Risk Modeling in R

Input mancanti

 loan_status    loan_amnt  int_rate  grade emp_length  home_ownership annual_inc   age
...      ...          ...        ...   ...        ...            ...          ...  ...
125        0         6000      14.27     C         14       MORTGAGE        94800   23
126        1         2500       7.51     A         NA            OWN        12000   21
127        0        13500       9.91     B          2       MORTGAGE        36000   30
128        0        25000      12.42     B          2           RENT        225000  30
129        0        10000         NA     C          2           RENT        45900   65
130        0         2500      13.49     C          4           RENT        27200   26  
...      ...          ...        ...   ...        ...            ...          ...  ...
2108       0         8000       7.90     A          8           RENT        64000   24
2109       0        12000       8.90     A          0           RENT        38400   26
2110       0         4000         NA     A          7           RENT        48000   30
2111       0         7000       9.91     B         20       MORTGAGE       130000   30
2112       0         7600       6.03     A         41       MORTGAGE        70920   28
2113       0        10000      11.71     B          5           RENT        48132   22
2114       0         8000       6.62     A         17            OWN        42000   24
2115       0         4475         NA     B         NA            OWN        15000   23
2116       0         5750       8.90     A          3           RENT        17000   21
2117       0         4900       6.03     A         12       MORTGAGE        77000   27
…         …          …           …      …          …            …           …      …
Credit Risk Modeling in R

Input mancanti

summary(loan_data$emp_length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   2.000   4.000   6.145   8.000  62.000     809
Credit Risk Modeling in R

Input mancanti: strategie

  • Elimina riga/colonna
  • Sostituisci
  • Mantieni
Credit Risk Modeling in R

Eliminare righe

index_NA <- which(is.na(loan_data$emp_length)
loan_data_no_NA <- loan_data[-c(index_NA), ]
loan_status   loan_amnt   int_rate grade emp_length  home_ownership  annual_inc  age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Eliminare colonna

loan_data_delete_employ <- loan_data
loan_data_delete_employ$emp_length <- NULL
loan_status   loan_amnt   int_rate grade   home_ownership  annual_inc  age
...     ...         ...        ...   ...              ...          ...  ...
125       0        6000      14.27     C         MORTGAGE        94800   23
126       1        2500       7.51     A              OWN        12000   21
127       0       13500       9.91     B         MORTGAGE        36000   30
128       0       25000      12.42     B             RENT        225000  30
129       0       10000         NA     C             RENT        45900   65
130       0        2500      13.49     C             RENT        27200   26  
...     ...         ...        ...   ...              ...          ...  ...
2112      0        7600       6.03     A         MORTGAGE        70920   28
2113      0       10000      11.71     B             RENT        48132   22
2114      0        8000       6.62     A              OWN        42000   24
2115      0        4475         NA     B              OWN        15000   23
2116      0        5750       8.90     A             RENT        17000   21
...     ...         ...        ...   ...              ...          ...  ...
Credit Risk Modeling in R

Sostituire: imputazione con mediana

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Sostituire: imputazione con mediana

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A          4            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B          4            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Mantieni

  • Mantieni NA
  • Problema: molti modelli eliminano righe
  • Soluzione: classificazione grossolana, mettere la variabile in "bin"
    • Nuova variabile emp_cat
    • Range: 0-62 anni → crea bin di ±15 anni
    • Categorie: "0-15", "15-30", "30-45", "45+", "missing"
Credit Risk Modeling in R

Mantieni: classificazione grossolana

loan_status  loan_amnt  int_rate  grade  emp_length  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C         14       MORTGAGE        94800   23
126       1         2500       7.51     A         NA            OWN        12000   21
127       0        13500       9.91     B          2       MORTGAGE        36000   30
128       0        25000      12.42     B          2           RENT        225000  30
129       0        10000         NA     C          2           RENT        45900   65
130       0         2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A         41       MORTGAGE        70920   28
2113      0        10000      11.71     B          5           RENT        48132   22
2114      0         8000       6.62     A         17            OWN        42000   24
2115      0         4475         NA     B         NA            OWN        15000   23
2116      0         5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Mantieni: classificazione grossolana

loan_status  loan_amnt  int_rate  grade     emp_cat  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C      0-15      MORTGAGE        94800   23
126       1         2500       7.51     A   Missing           OWN        12000   21
127       0        13500       9.91     B      0-15      MORTGAGE        36000   30
128       0        25000      12.42     B      0-15          RENT        225000  30
129       0        10000         NA     C      0-15          RENT        45900   65
130       0         2500      13.49     C      0-15          RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A     30-45      MORTGAGE        70920   28
2113      0        10000      11.71     B      0-15          RENT        48132   22
2114      0         8000       6.62     A     15-30           OWN        42000   24
2115      0         4475         NA     B   Missing           OWN        15000   23
2116      0         5750       8.90     A      0-15          RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Credit Risk Modeling in R

Frequenze dei bin

plot(loan_data$emp_cat)

Schermata 2020-06-15 alle 08.39.14.png

emp_cat
  ...
  0-15
  Missing
  0-15
  0-15
  0-15
  0-15
  ...
  30-45
  0-15
  15-30
  Missing
  0-15
  ...
Credit Risk Modeling in R

Frequenze dei bin

plot(loan_data$emp_cat)

Schermata 2020-06-15 alle 08.39.02.png

emp_cat
  ...
  8+
  Missing
  0-2
  0-2
  0-2
  3-4
  ...
  8+
  5-8
  8+
  Missing
  3-4
  ...
Credit Risk Modeling in R

Considerazioni finali

  • Tratta gli outlier come NA
Credit Risk Modeling in R

Considerazioni finali

  • Tratta gli outlier come NA

$$

CONTINUOUS CATEGORICAL
DELETE Elimina righe (osservazioni con NA) Elimina colonna (intera variabile) Elimina righe (osservazioni con NA) Elimina colonna (intera variabile)
REPLACE Sostituisci con la mediana Sostituisci con la categoria più frequente
KEEP Mantieni come NA (non sempre possibile) Mantieni con classificazione grossolana Categoria NA
Credit Risk Modeling in R

Passiamo alla pratica!

Credit Risk Modeling in R

Preparing Video For Download...