R-squared ($R^2$)

Supervised Learning in R: Regression

Nina Zumel and John Mount

Win-Vector LLC

What is $R^2$?

A measure of how well the model fits or explains the data

  • A value between 0-1
    • near 1: model fits well
    • near 0: no better than guessing the average value
Supervised Learning in R: Regression

Calculating $R^2$

$R^2$ is the variance explained by the model.

$$ R^2 = 1 - \frac{RSS}{SS_{Tot}} $$

where

  • $RSS = \sum{(y - prediction)^2}$
    • Residual sum of squares (variance from model)
  • $SS_{Tot} = \sum{(y - \overline{y})^2}$
    • Total sum of squares (variance of data)
Supervised Learning in R: Regression

Calculate $R^2$ of the House Price Model: RSS

  • Calculate error
err <- houseprices$prediction - houseprices$price
  • Square it and take the sum
rss <- sum(err^2)
  • price: column of actual sale prices (in thousands)
  • pred: column of predicted sale prices (in thousands)
  • $RSS \approx$ 136138
Supervised Learning in R: Regression

Calculate $R^2$ of the House Price Model: $SS_{Tot}$

  • Take the difference of prices from the mean price
toterr <- houseprices$price - mean(houseprices$price)
  • Square it and take the sum
sstot <- sum(toterr^2)
  • $RSS \approx$ 136138
  • $SS_{Tot} \approx$ 713615
Supervised Learning in R: Regression

Calculate $R^2$ of the House Price Model

(r_squared <- 1 - (rss/sstot) )
0.8092278
  • $RSS \approx$ 136138
  • $SS_{Tot} \approx$ 713615
  • $R^2 \approx$ 0.809
Supervised Learning in R: Regression

Reading $R^2$ from the lm() model

# From summary()
summary(hmodel)
...
Residual standard error: 60.66 on 37 degrees of freedom
Multiple R-squared:  0.8092, Adjusted R-squared:  0.7989 
F-statistic: 78.47 on 2 and 37 DF,  p-value: 4.893e-14
summary(hmodel)$r.squared
0.8092278
# From glance()
glance(hmodel)$r.squared
0.8092278
Supervised Learning in R: Regression

Correlation and $R^2$

rho <- cor(houseprices$prediction, houseprices$price)
0.8995709
rho^2
0.8092278
  • $\rho$ = cor(prediction, price) = 0.8995709
  • $\rho^2$ = 0.8092278 = $R^2$
Supervised Learning in R: Regression

Correlation and $R^2$

  • True for models that minimize squared error:
    • Linear regression
    • GAM regression
    • Tree-based algorithms that minimize squared error
  • True for training data; NOT true for future application data
Supervised Learning in R: Regression

Let's practice!

Supervised Learning in R: Regression

Preparing Video For Download...