Histograms and outliers

Credit Risk Modeling in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Using function hist()

hist(loan_data$int_rate)

Histogram of interest rate

Credit Risk Modeling in R

Using function hist()

hist(loan_data$int_rate, main = "Histogram of interest rate", xlab = "Interest rate")

Histogram of interest rate

Credit Risk Modeling in R

Using function hist() on annual_inc

hist(loan_data$annual_inc, xlab = "Annual Income", main = "Histogram of Annual Income")

Screen Shot 2020-06-12 at 1.55.03 PM.png

Credit Risk Modeling in R

Using function hist() on annual_inc

hist_income <- hist(loan_data$annual_inc,
                    xlab = "Annual Income",
                    main = "Histogram of Annual Income")
hist_income$breaks
0  500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 ...
Credit Risk Modeling in R

The breaks-argument

n_breaks <- sqrt(nrow(loan_data)) # n_breaks = 170.5638
hist_income_n <- hist(loan_data$annual_inc, breaks = n_breaks, 
                      xlab = "Annual Income", main = "Histogram of Annual Income")

Screen Shot 2020-06-12 at 1.55.58 PM.png

Credit Risk Modeling in R

annual_inc

plot(loan_data$annual_inc, ylab = "Annual Income")

Screen Shot 2020-06-12 at 1.56.24 PM.png

Credit Risk Modeling in R

annual_inc

plot(loan_data$annual_inc, ylab = "Annual Income")

Screen Shot 2020-06-12 at 1.56.53 PM.png

Credit Risk Modeling in R

Outliers

  • When is a value an outlier?

    • Expert judgment
    • Rule of thumb, e.g.,

      • Q1 - 1.5 * IQR
      • Q3 + 1.5 * IQR
    • Mostly: combination of both
Credit Risk Modeling in R

Expert judgment

"Annual salaries > $3 million are outliers"

$$

# Find outlier
index_outlier_expert <- which(loan_data$annual_inc > 3000000)

# Remove outlier from dataset
loan_data_expert <- loan_data[-index_outlier_expert, ]
Credit Risk Modeling in R

Rule of thumb

Outlier if bigger than Q3 + 1.5 * IQR

$$

# Calculate Q3 + 1.5 * IQR
outlier_cutoff <- quantile(loan_data$annual_inc, 0.75) + 1.5 * IQR(loan_data$annual_inc)

# Identify outliers index_outlier_ROT <- which(loan_data$annual_inc > outlier_cutoff)
# Remove outliers loan_data_ROT <- loan_data[-index_outlier_ROT, ]
Credit Risk Modeling in R
hist(loan_data_expert$annual_inc,
     sqrt(nrow(loan_data_expert)), 
     xlab = "Annual income")

hist(loan_data_ROT$annual_inc,
     sqrt(nrow(loan_data_ROT)), 
     xlab = "Annual income")

Credit Risk Modeling in R

Bivariate plot

plot(loan_data$emp_length, loan_data$annual_inc, 
     xlab= "Employment length", ylab= "Annual income")

Screen Shot 2020-06-12 at 1.58.14 PM.png

Credit Risk Modeling in R

Bivariate plot

plot(loan_data$emp_length, loan_data$annual_inc, 
     xlab= "Employment length", ylab= "Annual income")

Screen Shot 2020-06-12 at 1.58.34 PM.png

Credit Risk Modeling in R

Let's practice!

Credit Risk Modeling in R

Preparing Video For Download...