Assessing model performance

Intermediate Regression with statsmodels in Python

Maarten Van den Broeck

Content Developer at DataCamp

Model performance metrics

  • Coefficient of determination (R-squared): how well the linear regression line fits the observed values.

    • Larger is better.
  • Residual standard error (RSE): the typical size of the residuals.

    • Smaller is better.
Intermediate Regression with statsmodels in Python

Getting the coefficient of determination

print(mdl_mass_vs_length.rsquared)
0.8225689502644215
print(mdl_mass_vs_species.rsquared)
0.25814887709499157
print(mdl_mass_vs_both.rsquared)
0.9200433561156649
Intermediate Regression with statsmodels in Python

Adjusted coefficient of determination

  • More explanatory variables increases $R^2$.
  • Too many explanatory variables causes overfitting.
  • Adjusted coefficient of determination penalizes more explanatory variables.
  • $\bar{R ^ 2} = 1 - (1 - R ^ 2) \frac{n_{obs} - 1}{n_{obs} - n_{var} - 1}$
  • Penalty is noticeable when $R^2$ is small, or $n_{var}$ is large fraction of $n_{obs}$.
  • In statsmodels, it's contained in the rsquared_adj attribute.
Intermediate Regression with statsmodels in Python

Getting the adjusted coefficient of determination

print("rsq_length: ", mdl_mass_vs_length.rsquared)
print("rsq_adj_length: ", mdl_mass_vs_length.rsquared_adj)
rsq_length:  0.8225689502644215
rsq_adj_length:  0.8211607673300121
print("rsq_species: ", mdl_mass_vs_species.rsquared)
print("rsq_adj_species: ", mdl_mass_vs_species.rsquared_adj)
rsq_species:  0.25814887709499157
rsq_adj_species:  0.24020086605696722
print("rsq_both: ", mdl_mass_vs_both.rsquared
print("rsq_adj_both: ", mdl_mass_vs_both.rsquared_adj)
rsq_both:  0.9200433561156649
rsq_adj_both:  0.9174431400543857
Intermediate Regression with statsmodels in Python

Getting the residual standard error

rse_length = np.sqrt(mdl_mass_vs_length.mse_resid)
print("rse_length: ", rse_length)
rse_length:  152.12092835414788
rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)
rse_species:  313.5501156682592
rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)
rse_both:  103.35563303966488
Intermediate Regression with statsmodels in Python

Let's practice!

Intermediate Regression with statsmodels in Python

Preparing Video For Download...