Credit Risk Modeling in R
Lore Dirick
Manager of Data Science Curriculum at Flatiron School
hist(loan_data$int_rate)
hist(loan_data$int_rate, main = "Histogram of interest rate", xlab = "Interest rate")
hist(loan_data$annual_inc, xlab = "Annual Income", main = "Histogram of Annual Income")
hist_income <- hist(loan_data$annual_inc,
xlab = "Annual Income",
main = "Histogram of Annual Income")
hist_income$breaks
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 ...
n_breaks <- sqrt(nrow(loan_data)) # n_breaks = 170.5638
hist_income_n <- hist(loan_data$annual_inc, breaks = n_breaks,
xlab = "Annual Income", main = "Histogram of Annual Income")
plot(loan_data$annual_inc, ylab = "Annual Income")
plot(loan_data$annual_inc, ylab = "Annual Income")
When is a value an outlier?
Rule of thumb, e.g.,
$$
# Find outlier
index_outlier_expert <- which(loan_data$annual_inc > 3000000)
# Remove outlier from dataset
loan_data_expert <- loan_data[-index_outlier_expert, ]
$$
# Calculate Q3 + 1.5 * IQR outlier_cutoff <- quantile(loan_data$annual_inc, 0.75) + 1.5 * IQR(loan_data$annual_inc)
# Identify outliers index_outlier_ROT <- which(loan_data$annual_inc > outlier_cutoff)
# Remove outliers loan_data_ROT <- loan_data[-index_outlier_ROT, ]
hist(loan_data_expert$annual_inc,
sqrt(nrow(loan_data_expert)),
xlab = "Annual income")
hist(loan_data_ROT$annual_inc,
sqrt(nrow(loan_data_ROT)),
xlab = "Annual income")
plot(loan_data$emp_length, loan_data$annual_inc,
xlab= "Employment length", ylab= "Annual income")
plot(loan_data$emp_length, loan_data$annual_inc,
xlab= "Employment length", ylab= "Annual income")
Credit Risk Modeling in R