Logistische regressie: introductie

Kredietrisicomodellering in R

Lore Dirick

Manager of Data Science Curriculum at Flatiron School

Definitieve datastructuur

str(training_set)
'data.frame':\t19394 obs. of  8 variables:
 $ loan_status   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ loan_amnt     : int  25000 16000 8500 9800 3600 6600 3000 7500 6000 22750 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 4 1 2 1 1 1 2 1 1 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 1 1 3 4 3 4 1 ...
 $ annual_inc    : num  91000 45000 110000 102000 40000 ...
 $ age           : int  34 25 29 24 59 35 24 24 26 25 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 2 3 1 4 1 1 1 4 1 1 ...
Kredietrisicomodellering in R

Wat is logistische regressie?

  • Een regressiemodel met output tussen 0 en 1

$$P({\text{loan status}}=1|x_1,...,x_m) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_m x_m)}}$$

  • $x_1,...,x_m$:
loan_amnt  grade  age  annual_inc  home_ownership  emp_cat  ir_cat
  • $\beta_0,...\beta_m$: Te schatten parameters

  • $\beta_0 + \beta_1 x_1 + ... + \beta_m x_m$: Lineaire voorspeller

Kredietrisicomodellering in R

Een logistisch model fitten in R

log_model <- glm(loan_status ~ age , 
                 family= "binomial", data = training_set)
log_model
Call:  glm(formula = loan_status ~ age, 
           family = "binomial", data = training_set)
Coefficients:
(Intercept)          age  
  -1.793566    -0.009726  
Degrees of Freedom: 19393 Total (i.e. Null);  19392 Residual
Null Deviance:\t    13680 
Residual Deviance: 13670 \tAIC: 13670

$$P({\text{loan status}}=1|\text{age}) = \frac{1}{1+e^{-(\hat{\beta_0} + \hat{\beta_1} \text{age})}}$$

Kredietrisicomodellering in R

Waardes op wanbetaling

$$P({\text{loan status}}=1|x_1,...,x_m) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_m x_m)}} = \frac{e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}$$

$$

$$P({\text{loan status}}=0|x_1,...,x_m) = 1- \frac{e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}} = \frac{1}{1+e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}}$$

$$

$$\frac{P({\text{loan status}}=1|x_1,...,x_m)}{P({loan \space status}=0|x_1,...,x_m)} = e^{\beta_0 + \beta_1 x_1 + ... + \beta_m x_m}$$

  • Odds voor loan_status = 1
Kredietrisicomodellering in R

Interpretatie van coëfficiënt

  • Als variabele $x_j$ met 1 stijgt
    • Worden de odds vermenigvuldigd met $e^{\beta j}$
  • $\beta_j < 0$
    • $e^{\beta j} < 1$
    • De odds dalen als $x_j$ stijgt
  • $\beta_j > 0$
    • $e^{\beta j} > 1$
    • De odds stijgen als $x_j$ stijgt

Toegepast op ons model:

  • Als age met 1 stijgt
    • Worden de odds vermenigvuldigd met $e^{-0.009726}$
    • Worden de odds vermenigvuldigd met 0,991
Kredietrisicomodellering in R

Laten we oefenen!

Kredietrisicomodellering in R

Preparing Video For Download...