Introduction to logistic regression

HR Analytics: Predicting Employee Churn in R

Anurag Gupta

People Analytics Practitioner

What is logistic regression?

  • Classification technique
  • Predicts the probability of occurrence of an event
  • Dependent variable is categorical

HR Analytics: Predicting Employee Churn in R

Understanding logistic regression

  • Independent variables

    • Continuous / Categorical
    • age, tenure, compensation, level etc.
  • Dependent variable

    • Binary / Dichotomous variable
    • turnover (1, 0)
HR Analytics: Predicting Employee Churn in R

Building a simple logistic regression model

simple_log <- glm(turnover ~ emp_age, 
                  family = "binomial", data = train_set)
HR Analytics: Predicting Employee Churn in R
summary(simple_log)
Call:
glm(formula = turnover ~ emp_age, family = "binomial", data = train_set)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9431  -0.7406  -0.6107  -0.4006   2.4334  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.58131    0.58684   4.399 1.09e-05 ***
emp_age     -0.13864    0.02093  -6.623 3.52e-11 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1389.4  on 1367  degrees of freedom
Residual deviance: 1338.6  on 1366  degrees of freedom
AIC: 1342.6

Number of Fisher Scoring iterations: 4
HR Analytics: Predicting Employee Churn in R

Removing variables

  • emp_id, mgr_id (ID columns)
  • date_of_joining, last_working_date, cutoff_date (tenure is a linear combination of these columns)
  • median_compensation (directly related to level)
  • mgr_age, emp_age (age_diff is a linear combination of these columns)
  • department (only one possible value)
  • status (same as turnover)
HR Analytics: Predicting Employee Churn in R

Removing variables

# Drop variables and save the resulting object as train_set_multi
train_set_multi <- train_set %>%
  select(-c(emp_id, mgr_id, 
            date_of_joining, last_working_date, cutoff_date, 
            mgr_age, emp_age, 
            median_compensation, 
            department, status))
HR Analytics: Predicting Employee Churn in R

Building multiple logistic regression model

multi_log <- glm(turnover ~ ., family = "binomial", 
                 data = train_set_multi)
HR Analytics: Predicting Employee Churn in R
summary(multi_log)
Call:
glm(formula = turnover ~ ., family = "binomial", data = train_set_multi)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4235  -0.1392  -0.0345  -0.0001   3.4580  
Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -1.348e+01  4.813e+00  -2.800 0.005104 ** 
locationNew York                1.264e+00  4.655e-01   2.715 0.006624 ** 
locationOrlando                -1.031e+00  4.200e-01  -2.455 0.014077 *  
levelSpecialist                 1.583e+01  9.695e+02   0.016 0.986971    
percent_hike                   -5.669e-01  8.102e-02  -6.997 2.61e-12 ***  
tenure                         -5.863e-01  1.192e-01  -4.920 8.65e-07 ***    
total_experience                8.598e-02  8.380e-02   1.026 0.304871    
.....
# We removed several variables for brevity
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
 Null deviance: 1389.37  on 1367  degrees of freedom
Residual deviance:  326.66  on 1326  degrees of freedom
AIC: 410.66

Number of Fisher Scoring iterations: 18
HR Analytics: Predicting Employee Churn in R

Let's practice!

HR Analytics: Predicting Employee Churn in R

Preparing Video For Download...