Ontbrekende data en grove classificatie

Kredietrisicomodellering in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Uitschieter verwijderd

loan_status  loan_amnt  int_rate  grade   emp_length  home_ownership   annual_inc   age
     0         5000      12.73      C         12         MORTGAGE       6000000     144
Kredietrisicomodellering in R

Ontbrekende inputs

 loan_status    loan_amnt  int_rate  grade emp_length  home_ownership annual_inc   age
...      ...          ...        ...   ...        ...            ...          ...  ...
125        0         6000      14.27     C         14       MORTGAGE        94800   23
126        1         2500       7.51     A         NA            OWN        12000   21
127        0        13500       9.91     B          2       MORTGAGE        36000   30
128        0        25000      12.42     B          2           RENT        225000  30
129        0        10000         NA     C          2           RENT        45900   65
130        0         2500      13.49     C          4           RENT        27200   26  
...      ...          ...        ...   ...        ...            ...          ...  ...
2108       0         8000       7.90     A          8           RENT        64000   24
2109       0        12000       8.90     A          0           RENT        38400   26
2110       0         4000         NA     A          7           RENT        48000   30
2111       0         7000       9.91     B         20       MORTGAGE       130000   30
2112       0         7600       6.03     A         41       MORTGAGE        70920   28
2113       0        10000      11.71     B          5           RENT        48132   22
2114       0         8000       6.62     A         17            OWN        42000   24
2115       0         4475         NA     B         NA            OWN        15000   23
2116       0         5750       8.90     A          3           RENT        17000   21
2117       0         4900       6.03     A         12       MORTGAGE        77000   27
…         …          …           …      …          …            …           …      …
Kredietrisicomodellering in R

Ontbrekende inputs

summary(loan_data$emp_length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   2.000   4.000   6.145   8.000  62.000     809
Kredietrisicomodellering in R

Ontbrekende inputs: strategieën

  • Rij/kolom verwijderen
  • Vervangen
  • Behouden
Kredietrisicomodellering in R

Rijen verwijderen

index_NA <- which(is.na(loan_data$emp_length)
loan_data_no_NA <- loan_data[-c(index_NA), ]
loan_status   loan_amnt   int_rate grade emp_length  home_ownership  annual_inc  age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Kredietrisicomodellering in R

Kolom verwijderen

loan_data_delete_employ <- loan_data
loan_data_delete_employ$emp_length <- NULL
loan_status   loan_amnt   int_rate grade   home_ownership  annual_inc  age
...     ...         ...        ...   ...              ...          ...  ...
125       0        6000      14.27     C         MORTGAGE        94800   23
126       1        2500       7.51     A              OWN        12000   21
127       0       13500       9.91     B         MORTGAGE        36000   30
128       0       25000      12.42     B             RENT        225000  30
129       0       10000         NA     C             RENT        45900   65
130       0        2500      13.49     C             RENT        27200   26  
...     ...         ...        ...   ...              ...          ...  ...
2112      0        7600       6.03     A         MORTGAGE        70920   28
2113      0       10000      11.71     B             RENT        48132   22
2114      0        8000       6.62     A              OWN        42000   24
2115      0        4475         NA     B              OWN        15000   23
2116      0        5750       8.90     A             RENT        17000   21
...     ...         ...        ...   ...              ...          ...  ...
Kredietrisicomodellering in R

Vervangen: mediaan-imputatie

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Kredietrisicomodellering in R

Vervangen: mediaan-imputatie

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A          4            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B          4            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Kredietrisicomodellering in R

Behouden

  • Laat NA staan
  • Probleem: leidt bij veel modellen tot rij-verlies
  • Oplossing: grove classificatie, variabele in “bins”
    • Nieuwe variabele emp_cat
    • Bereik: 0–62 jaar → maak bins van ±15 jaar
    • Categorieën: "0-15", "15-30", "30-45", "45+", "missing"
Kredietrisicomodellering in R

Behouden: grove classificatie

loan_status  loan_amnt  int_rate  grade  emp_length  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C         14       MORTGAGE        94800   23
126       1         2500       7.51     A         NA            OWN        12000   21
127       0        13500       9.91     B          2       MORTGAGE        36000   30
128       0        25000      12.42     B          2           RENT        225000  30
129       0        10000         NA     C          2           RENT        45900   65
130       0         2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A         41       MORTGAGE        70920   28
2113      0        10000      11.71     B          5           RENT        48132   22
2114      0         8000       6.62     A         17            OWN        42000   24
2115      0         4475         NA     B         NA            OWN        15000   23
2116      0         5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Kredietrisicomodellering in R

Behouden: grove classificatie

loan_status  loan_amnt  int_rate  grade     emp_cat  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C      0-15      MORTGAGE        94800   23
126       1         2500       7.51     A   Missing           OWN        12000   21
127       0        13500       9.91     B      0-15      MORTGAGE        36000   30
128       0        25000      12.42     B      0-15          RENT        225000  30
129       0        10000         NA     C      0-15          RENT        45900   65
130       0         2500      13.49     C      0-15          RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A     30-45      MORTGAGE        70920   28
2113      0        10000      11.71     B      0-15          RENT        48132   22
2114      0         8000       6.62     A     15-30           OWN        42000   24
2115      0         4475         NA     B   Missing           OWN        15000   23
2116      0         5750       8.90     A      0-15          RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Kredietrisicomodellering in R

Bin-frequenties

plot(loan_data$emp_cat)

Schermafbeelding 2020-06-15 om 08.39.14.png

emp_cat
  ...
  0-15
  Missing
  0-15
  0-15
  0-15
  0-15
  ...
  30-45
  0-15
  15-30
  Missing
  0-15
  ...
Kredietrisicomodellering in R

Bin-frequenties

plot(loan_data$emp_cat)

Schermafbeelding 2020-06-15 om 08.39.02.png

emp_cat
  ...
  8+
  Missing
  0-2
  0-2
  0-2
  3-4
  ...
  8+
  5-8
  8+
  Missing
  3-4
  ...
Kredietrisicomodellering in R

Slotopmerkingen

  • Behandel uitschieters als NA’s
Kredietrisicomodellering in R

Slotopmerkingen

  • Behandel uitschieters als NA’s

$$

CONTINUOUS CATEGORICAL
DELETE Rijen verwijderen (observaties met NA’s) Kolom verwijderen (gehele variabele) Rijen verwijderen (observaties met NA’s) Kolom verwijderen (gehele variabele)
REPLACE Vervangen met mediaan Vervangen met meest frequente categorie
KEEP Als NA laten staan (niet altijd mogelijk) Behouden met grove classificatie NA-categorie
Kredietrisicomodellering in R

Laten we oefenen!

Kredietrisicomodellering in R

Preparing Video For Download...