Logistic regression: introduction

Credit Risk Modeling in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Final data structure

str(training_set)
'data.frame':\t19394 obs. of  8 variables:
 $ loan_status   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ loan_amnt     : int  25000 16000 8500 9800 3600 6600 3000 7500 6000 22750 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 4 1 2 1 1 1 2 1 1 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 1 1 3 4 3 4 1 ...
 $ annual_inc    : num  91000 45000 110000 102000 40000 ...
 $ age           : int  34 25 29 24 59 35 24 24 26 25 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 2 3 1 4 1 1 1 4 1 1 ...
Credit Risk Modeling in R

What is logistic regression?

  • A regression model with output between 0 and 1

$$P({\text{loan status}}=1|x_1,...,x_m) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_m x_m)}}$$

  • $x_1,...,x_m$:
loan_amnt  grade  age  annual_inc  home_ownership  emp_cat  ir_cat
  • $\beta_0,...\beta_m$: Parameters to be estimated

  • $\beta_0 + \beta_1 x_1 + ... + \beta_m x_m$: Linear predictor

Credit Risk Modeling in R

Fitting a logistic model in R

log_model <- glm(loan_status ~ age , 
                 family= "binomial", data = training_set)
log_model
Call:  glm(formula = loan_status ~ age, 
           family = "binomial", data = training_set)
Coefficients:
(Intercept)          age  
  -1.793566    -0.009726  
Degrees of Freedom: 19393 Total (i.e. Null);  19392 Residual
Null Deviance:\t    13680 
Residual Deviance: 13670 \tAIC: 13670

$$P({\text{loan status}}=1|\text{age}) = \frac{1}{1+e^{-(\hat{\beta_0} + \hat{\beta_1} \text{age})}}$$

Credit Risk Modeling in R

Probabilities of default

$$P({\text{loan status}}=1|x_1,...,x_m) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_m x_m)}} = \frac{e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}$$

$$

$$P({\text{loan status}}=0|x_1,...,x_m) = 1- \frac{e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}} = \frac{1}{1+e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}$$

$$

$$\frac{P({\text{loan status}}=1|x_1,...,x_m)}{P({loan \space status}=0|x_1,...,x_m)} = e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}$$

  • Odds in favor of loan_status = 1
Credit Risk Modeling in R

Interpretation of coefficient

  • If variable $x_j$ goes up by 1
    • The odds are multiplied by $e^{\beta j}$
  • $\beta_j < 0$
    • $e^{\beta j} < 1$
    • The odds decrease as $x_j$ increases
  • $\beta_j > 0$
    • $e^{\beta j} > 1$
    • The odds increase as $x_j$ increases

Applied to our model:

  • If variable age goes up by 1
    • The odds are multiplied by $e^{-0.009726}$
    • The odds are multiplied by 0.991
Credit Risk Modeling in R

Let's practice!

Credit Risk Modeling in R

Preparing Video For Download...