Data hilang dan klasifikasi kasar

Pemodelan Risiko Kredit di R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Pencilan dihapus

loan_status  loan_amnt  int_rate  grade   emp_length  home_ownership   annual_inc   age
     0         5000      12.73      C         12         MORTGAGE       6000000     144
Pemodelan Risiko Kredit di R

Masukan hilang

 loan_status    loan_amnt  int_rate  grade emp_length  home_ownership annual_inc   age
...      ...          ...        ...   ...        ...            ...          ...  ...
125        0         6000      14.27     C         14       MORTGAGE        94800   23
126        1         2500       7.51     A         NA            OWN        12000   21
127        0        13500       9.91     B          2       MORTGAGE        36000   30
128        0        25000      12.42     B          2           RENT        225000  30
129        0        10000         NA     C          2           RENT        45900   65
130        0         2500      13.49     C          4           RENT        27200   26  
...      ...          ...        ...   ...        ...            ...          ...  ...
2108       0         8000       7.90     A          8           RENT        64000   24
2109       0        12000       8.90     A          0           RENT        38400   26
2110       0         4000         NA     A          7           RENT        48000   30
2111       0         7000       9.91     B         20       MORTGAGE       130000   30
2112       0         7600       6.03     A         41       MORTGAGE        70920   28
2113       0        10000      11.71     B          5           RENT        48132   22
2114       0         8000       6.62     A         17            OWN        42000   24
2115       0         4475         NA     B         NA            OWN        15000   23
2116       0         5750       8.90     A          3           RENT        17000   21
2117       0         4900       6.03     A         12       MORTGAGE        77000   27
…         …          …           …      …          …            …           …      …
Pemodelan Risiko Kredit di R

Masukan hilang

summary(loan_data$emp_length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   2.000   4.000   6.145   8.000  62.000     809
Pemodelan Risiko Kredit di R

Masukan hilang: strategi

  • Hapus baris/kolom
  • Ganti
  • Biarkan
Pemodelan Risiko Kredit di R

Hapus baris

index_NA <- which(is.na(loan_data$emp_length)
loan_data_no_NA <- loan_data[-c(index_NA), ]
loan_status   loan_amnt   int_rate grade emp_length  home_ownership  annual_inc  age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Pemodelan Risiko Kredit di R

Hapus kolom

loan_data_delete_employ <- loan_data
loan_data_delete_employ$emp_length <- NULL
loan_status   loan_amnt   int_rate grade   home_ownership  annual_inc  age
...     ...         ...        ...   ...              ...          ...  ...
125       0        6000      14.27     C         MORTGAGE        94800   23
126       1        2500       7.51     A              OWN        12000   21
127       0       13500       9.91     B         MORTGAGE        36000   30
128       0       25000      12.42     B             RENT        225000  30
129       0       10000         NA     C             RENT        45900   65
130       0        2500      13.49     C             RENT        27200   26  
...     ...         ...        ...   ...              ...          ...  ...
2112      0        7600       6.03     A         MORTGAGE        70920   28
2113      0       10000      11.71     B             RENT        48132   22
2114      0        8000       6.62     A              OWN        42000   24
2115      0        4475         NA     B              OWN        15000   23
2116      0        5750       8.90     A             RENT        17000   21
...     ...         ...        ...   ...              ...          ...  ...
Pemodelan Risiko Kredit di R

Ganti: imputasi median

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A         NA            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B         NA            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Pemodelan Risiko Kredit di R

Ganti: imputasi median

index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt  int_rate  grade  emp_length home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0        6000      14.27     C         14       MORTGAGE        94800   23
126       1        2500       7.51     A          4            OWN        12000   21
127       0       13500       9.91     B          2       MORTGAGE        36000   30
128       0       25000      12.42     B          2           RENT        225000  30
129       0       10000         NA     C          2           RENT        45900   65
130       0        2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0        7600       6.03     A         41       MORTGAGE        70920   28
2113      0       10000      11.71     B          5           RENT        48132   22
2114      0        8000       6.62     A         17            OWN        42000   24
2115      0        4475         NA     B          4            OWN        15000   23
2116      0        5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Pemodelan Risiko Kredit di R

Biarkan

  • Biarkan NA
  • Masalah: banyak model akan menghapus baris
  • Solusi: klasifikasi kasar, bagi variabel ke "bin"
    • Variabel baru emp_cat
    • Rentang: 0–62 tahun → buat bin ±15 tahun
    • Kategori: "0-15", "15-30", "30-45", "45+", "missing"
Pemodelan Risiko Kredit di R

Biarkan: klasifikasi kasar

loan_status  loan_amnt  int_rate  grade  emp_length  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C         14       MORTGAGE        94800   23
126       1         2500       7.51     A         NA            OWN        12000   21
127       0        13500       9.91     B          2       MORTGAGE        36000   30
128       0        25000      12.42     B          2           RENT        225000  30
129       0        10000         NA     C          2           RENT        45900   65
130       0         2500      13.49     C          4           RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A         41       MORTGAGE        70920   28
2113      0        10000      11.71     B          5           RENT        48132   22
2114      0         8000       6.62     A         17            OWN        42000   24
2115      0         4475         NA     B         NA            OWN        15000   23
2116      0         5750       8.90     A          3           RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Pemodelan Risiko Kredit di R

Biarkan: klasifikasi kasar

loan_status  loan_amnt  int_rate  grade     emp_cat  home_ownership annual_inc age
...     ...         ...        ...   ...        ...            ...          ...  ...
125       0         6000      14.27     C      0-15      MORTGAGE        94800   23
126       1         2500       7.51     A   Missing           OWN        12000   21
127       0        13500       9.91     B      0-15      MORTGAGE        36000   30
128       0        25000      12.42     B      0-15          RENT        225000  30
129       0        10000         NA     C      0-15          RENT        45900   65
130       0         2500      13.49     C      0-15          RENT        27200   26  
...     ...         ...        ...   ...        ...            ...          ...  ...
2112      0         7600       6.03     A     30-45      MORTGAGE        70920   28
2113      0        10000      11.71     B      0-15          RENT        48132   22
2114      0         8000       6.62     A     15-30           OWN        42000   24
2115      0         4475         NA     B   Missing           OWN        15000   23
2116      0         5750       8.90     A      0-15          RENT        17000   21
...     ...         ...        ...   ...        ...            ...          ...  ...
Pemodelan Risiko Kredit di R

Frekuensi bin

plot(loan_data$emp_cat)

Tangkapan Layar 2020-06-15 pukul 8.39.14 AM.png

emp_cat
  ...
  0-15
  Missing
  0-15
  0-15
  0-15
  0-15
  ...
  30-45
  0-15
  15-30
  Missing
  0-15
  ...
Pemodelan Risiko Kredit di R

Frekuensi bin

plot(loan_data$emp_cat)

Tangkapan Layar 2020-06-15 pukul 8.39.02 AM.png

emp_cat
  ...
  8+
  Missing
  0-2
  0-2
  0-2
  3-4
  ...
  8+
  5-8
  8+
  Missing
  3-4
  ...
Pemodelan Risiko Kredit di R

Catatan akhir

  • Perlakukan pencilan sebagai NA
Pemodelan Risiko Kredit di R

Catatan akhir

  • Perlakukan pencilan sebagai NA

$$

KONTINU KATEGORIK
HAPUS Hapus baris (observasi dengan NA) Hapus kolom (seluruh variabel) Hapus baris (observasi dengan NA) Hapus kolom (seluruh variabel)
GANTI Ganti dengan median Ganti dengan kategori paling sering
BIARKAN Biarkan sebagai NA (tidak selalu bisa) Biarkan dengan klasifikasi kasar Kategori NA
Pemodelan Risiko Kredit di R

Ayo berlatih!

Pemodelan Risiko Kredit di R

Preparing Video For Download...