Detecting and dealing with multicollinearity

HR Analytics: Predicting Employee Churn in R

Anurag Gupta

People Analytics Practitioner

Understanding correlation

Correlation is the measure of association between two numeric variables

HR Analytics: Predicting Employee Churn in R

HR Analytics: Predicting Employee Churn in R

Calculating correlation in R

# Calculate the correlation coefficient
cor(train_set$emp_age, train_set$compensation)
0.6117855
HR Analytics: Predicting Employee Churn in R

What is multicollinearity?

Multicollinearity occurs when one independent variable is highly collinear with a set of two or more independent variables.

HR Analytics: Predicting Employee Churn in R

How to detect multicollinearity?

VIF (Variance Inflation Factor)
# Load car package 
library(car)

# Logistic regression model
multi_log <- glm(turnover ~ ., family = "binomial",
                 data = train_set_multi)

# Calculate VIF  
vif(multi_log)
HR Analytics: Predicting Employee Churn in R

Variance inflation factor

                                     GVIF Df GVIF^(1/(2*Df))
location                     2.318640e+00  2        1.233981
level                        5.716850e+06  1     2390.993458
gender                       1.262625e+00  1        1.123666
rating                       4.381767e+00  4        1.202835
mgr_rating                   2.471489e+00  4        1.119747
mgr_reportees                1.314709e+00  1        1.146608
mgr_tenure                   1.278559e+00  1        1.130734
compensation                 3.998338e+01  1        6.323241
percent_hike                 3.167576e+00  1        1.779769
hiring_score                 1.143613e+00  1        1.069399
hiring_source                2.000099e+00  6        1.059467
no_previous_companies_worked 3.291703e+00  1        1.814305
distance_from_home           1.355795e+00  1        1.164386
total_dependents             1.930188e+00  1        1.389312
marital_status               2.320518e+00  1        1.523325
education                    1.460697e+00  1        1.208593
.....

HR Analytics: Predicting Employee Churn in R

Rule of thumb for interpreting VIF value

VIF Interpretation
1 Not correlated
Between 1 and 5 Moderately correlated
Greater than 5 Highly correlated
HR Analytics: Predicting Employee Churn in R

How to deal with multicollinearity?

  • Step 1: Calculate VIF of the model
  • Step 2: Identify if any variable has VIF greater than 5
    • Step 2a: Remove the variable from the model if it has a VIF of 5
    • Step 2b: If there are multiple variables with VIF greater than 5, only remove the variable with the highest VIF
  • Step 3: Repeat steps 1 and 2 until VIF of each variable is less than 5
HR Analytics: Predicting Employee Churn in R

Removing a variable from a model

new_model <- glm(dependent_variable ~ . - variable_to_remove, 
                 family = "binomial", data = dataset)
HR Analytics: Predicting Employee Churn in R

Let's practice!

HR Analytics: Predicting Employee Churn in R

Preparing Video For Download...