Model selection: regression models

Practicing Machine Learning Interview Questions in Python

Lisa Stuart

Data Scientist

Multicollinearity

  • High correlation of independent variables
  • Estimated regression coefficients
    • Change in DV explained by IV
    • While holding other vars constant

Multicollinearity

1 https://eigenblogger.com/2010/03/26/post1426/
Practicing Machine Learning Interview Questions in Python

Effects of multicollinearity

  • Reducing coefficients
  • Reducing p-values
  • Unstable variance
  • Overfitting
  • Decreased statistical significance due to increased standard error
  • True relationship with target variable unclear
Practicing Machine Learning Interview Questions in Python

Techniques to address multicollinearity

  • Correlation matrix
  • Heatmap of correlations
  • Calculate the variance inflation factor (VIF)
  • Introduce penalizations (Ridge, Lasso)
  • PCA
Practicing Machine Learning Interview Questions in Python

Correlation matrix vs heatmap

Heatmap

Practicing Machine Learning Interview Questions in Python

Variance inflation factor

VIF value Multicollinearity
<= 1 no
> 1 yes, but can ignore
> 5 yes, need to address
Practicing Machine Learning Interview Questions in Python

Functions

Function/method returns
sklearn.linear_model.LinearRegression Linear Regression
data.corr() correlation matrix
sns.heatmap(corr) heatmap of correlations
mod.coef_ estimated model coefficients
mean_squared_error(y_test, y_pred) MSE
r2_score(y_test, y_pred) R-squared score
df.columns column names
Practicing Machine Learning Interview Questions in Python

Let's practice!

Practicing Machine Learning Interview Questions in Python

Preparing Video For Download...