Credit Risk Modeling in R
Lore Dirick
Manager of Data Science Curriculum at Flatiron School
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
0 5000 12.73 C 12 MORTGAGE 6000000 144
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
... ... ... ... ... ... ... ... ...
125 0 6000 14.27 C 14 MORTGAGE 94800 23
126 1 2500 7.51 A NA OWN 12000 21
127 0 13500 9.91 B 2 MORTGAGE 36000 30
128 0 25000 12.42 B 2 RENT 225000 30
129 0 10000 NA C 2 RENT 45900 65
130 0 2500 13.49 C 4 RENT 27200 26
... ... ... ... ... ... ... ... ...
2108 0 8000 7.90 A 8 RENT 64000 24
2109 0 12000 8.90 A 0 RENT 38400 26
2110 0 4000 NA A 7 RENT 48000 30
2111 0 7000 9.91 B 20 MORTGAGE 130000 30
2112 0 7600 6.03 A 41 MORTGAGE 70920 28
2113 0 10000 11.71 B 5 RENT 48132 22
2114 0 8000 6.62 A 17 OWN 42000 24
2115 0 4475 NA B NA OWN 15000 23
2116 0 5750 8.90 A 3 RENT 17000 21
2117 0 4900 6.03 A 12 MORTGAGE 77000 27
… … … … … … … … …
summary(loan_data$emp_length)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 2.000 4.000 6.145 8.000 62.000 809
index_NA <- which(is.na(loan_data$emp_length)
loan_data_no_NA <- loan_data[-c(index_NA), ]
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
... ... ... ... ... ... ... ... ...
125 0 6000 14.27 C 14 MORTGAGE 94800 23
126 1 2500 7.51 A NA OWN 12000 21
127 0 13500 9.91 B 2 MORTGAGE 36000 30
128 0 25000 12.42 B 2 RENT 225000 30
129 0 10000 NA C 2 RENT 45900 65
130 0 2500 13.49 C 4 RENT 27200 26
... ... ... ... ... ... ... ... ...
2112 0 7600 6.03 A 41 MORTGAGE 70920 28
2113 0 10000 11.71 B 5 RENT 48132 22
2114 0 8000 6.62 A 17 OWN 42000 24
2115 0 4475 NA B NA OWN 15000 23
2116 0 5750 8.90 A 3 RENT 17000 21
... ... ... ... ... ... ... ... ...
loan_data_delete_employ <- loan_data
loan_data_delete_employ$emp_length <- NULL
loan_status loan_amnt int_rate grade home_ownership annual_inc age
... ... ... ... ... ... ... ...
125 0 6000 14.27 C MORTGAGE 94800 23
126 1 2500 7.51 A OWN 12000 21
127 0 13500 9.91 B MORTGAGE 36000 30
128 0 25000 12.42 B RENT 225000 30
129 0 10000 NA C RENT 45900 65
130 0 2500 13.49 C RENT 27200 26
... ... ... ... ... ... ... ...
2112 0 7600 6.03 A MORTGAGE 70920 28
2113 0 10000 11.71 B RENT 48132 22
2114 0 8000 6.62 A OWN 42000 24
2115 0 4475 NA B OWN 15000 23
2116 0 5750 8.90 A RENT 17000 21
... ... ... ... ... ... ... ...
index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
... ... ... ... ... ... ... ... ...
125 0 6000 14.27 C 14 MORTGAGE 94800 23
126 1 2500 7.51 A NA OWN 12000 21
127 0 13500 9.91 B 2 MORTGAGE 36000 30
128 0 25000 12.42 B 2 RENT 225000 30
129 0 10000 NA C 2 RENT 45900 65
130 0 2500 13.49 C 4 RENT 27200 26
... ... ... ... ... ... ... ... ...
2112 0 7600 6.03 A 41 MORTGAGE 70920 28
2113 0 10000 11.71 B 5 RENT 48132 22
2114 0 8000 6.62 A 17 OWN 42000 24
2115 0 4475 NA B NA OWN 15000 23
2116 0 5750 8.90 A 3 RENT 17000 21
... ... ... ... ... ... ... ... ...
index_NA <- which(is.na(loan_data$emp_length)
loan_data_replace <- loan_data
loan_data_replace$emp_length[index_NA] <- median(loan_data$emp_length, na.rm = TRUE)
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
... ... ... ... ... ... ... ... ...
125 0 6000 14.27 C 14 MORTGAGE 94800 23
126 1 2500 7.51 A 4 OWN 12000 21
127 0 13500 9.91 B 2 MORTGAGE 36000 30
128 0 25000 12.42 B 2 RENT 225000 30
129 0 10000 NA C 2 RENT 45900 65
130 0 2500 13.49 C 4 RENT 27200 26
... ... ... ... ... ... ... ... ...
2112 0 7600 6.03 A 41 MORTGAGE 70920 28
2113 0 10000 11.71 B 5 RENT 48132 22
2114 0 8000 6.62 A 17 OWN 42000 24
2115 0 4475 NA B 4 OWN 15000 23
2116 0 5750 8.90 A 3 RENT 17000 21
... ... ... ... ... ... ... ... ...
NA
emp_cat
loan_status loan_amnt int_rate grade emp_length home_ownership annual_inc age
... ... ... ... ... ... ... ... ...
125 0 6000 14.27 C 14 MORTGAGE 94800 23
126 1 2500 7.51 A NA OWN 12000 21
127 0 13500 9.91 B 2 MORTGAGE 36000 30
128 0 25000 12.42 B 2 RENT 225000 30
129 0 10000 NA C 2 RENT 45900 65
130 0 2500 13.49 C 4 RENT 27200 26
... ... ... ... ... ... ... ... ...
2112 0 7600 6.03 A 41 MORTGAGE 70920 28
2113 0 10000 11.71 B 5 RENT 48132 22
2114 0 8000 6.62 A 17 OWN 42000 24
2115 0 4475 NA B NA OWN 15000 23
2116 0 5750 8.90 A 3 RENT 17000 21
... ... ... ... ... ... ... ... ...
loan_status loan_amnt int_rate grade emp_cat home_ownership annual_inc age
... ... ... ... ... ... ... ... ...
125 0 6000 14.27 C 0-15 MORTGAGE 94800 23
126 1 2500 7.51 A Missing OWN 12000 21
127 0 13500 9.91 B 0-15 MORTGAGE 36000 30
128 0 25000 12.42 B 0-15 RENT 225000 30
129 0 10000 NA C 0-15 RENT 45900 65
130 0 2500 13.49 C 0-15 RENT 27200 26
... ... ... ... ... ... ... ... ...
2112 0 7600 6.03 A 30-45 MORTGAGE 70920 28
2113 0 10000 11.71 B 0-15 RENT 48132 22
2114 0 8000 6.62 A 15-30 OWN 42000 24
2115 0 4475 NA B Missing OWN 15000 23
2116 0 5750 8.90 A 0-15 RENT 17000 21
... ... ... ... ... ... ... ... ...
plot(loan_data$emp_cat)
emp_cat
...
0-15
Missing
0-15
0-15
0-15
0-15
...
30-45
0-15
15-30
Missing
0-15
...
plot(loan_data$emp_cat)
emp_cat
...
8+
Missing
0-2
0-2
0-2
3-4
...
8+
5-8
8+
Missing
3-4
...
NA
sNA
s$$
CONTINUOUS | CATEGORICAL | |
---|---|---|
DELETE | Delete rows (observations with NA s) Delete column (entire variable) |
Delete rows (observations with NA s) Delete column (entire variable) |
REPLACE | Replace using median | Replace using most frequent category |
KEEP | Keep as NA (not always possible) Keep using coarse classification |
NA category |
Credit Risk Modeling in R