R-squared ($R^2$)

Supervised Learning in R: Regression

Nina Zumel and John Mount

Win-Vector LLC

What is $R^2$?

A measure of how well the model fits or explains the data

A value between 0-1
- near 1: model fits well
- near 0: no better than guessing the average value

Calculating $R^2$

$R^2$ is the variance explained by the model.

$$ R^2 = 1 - \frac{RSS}{SS_{Tot}} $$

where

$RSS = \sum{(y - prediction)^2}$
- Residual sum of squares (variance from model)
$SS_{Tot} = \sum{(y - \overline{y})^2}$
- Total sum of squares (variance of data)

Calculate $R^2$ of the House Price Model: RSS

Calculate error

err <- houseprices$prediction - houseprices$price

Square it and take the sum

rss <- sum(err^2)

price: column of actual sale prices (in thousands)
pred: column of predicted sale prices (in thousands)
$RSS \approx$ 136138

Calculate $R^2$ of the House Price Model: $SS_{Tot}$

Take the difference of prices from the mean price

toterr <- houseprices$price - mean(houseprices$price)

Square it and take the sum

sstot <- sum(toterr^2)

$RSS \approx$ 136138
$SS_{Tot} \approx$ 713615

Calculate $R^2$ of the House Price Model

(r_squared <- 1 - (rss/sstot) )

0.8092278

$RSS \approx$ 136138
$SS_{Tot} \approx$ 713615
$R^2 \approx$ 0.809

Reading $R^2$ from the lm() model

# From summary()
summary(hmodel)

...
Residual standard error: 60.66 on 37 degrees of freedom
Multiple R-squared:  0.8092, Adjusted R-squared:  0.7989 
F-statistic: 78.47 on 2 and 37 DF,  p-value: 4.893e-14

summary(hmodel)$r.squared

0.8092278

# From glance()
glance(hmodel)$r.squared

0.8092278

Correlation and $R^2$

rho <- cor(houseprices$prediction, houseprices$price)

0.8995709

rho^2

0.8092278

$\rho$ = cor(prediction, price) = 0.8995709
$\rho^2$ = 0.8092278 = $R^2$

Correlation and $R^2$

True for models that minimize squared error:
- Linear regression
- GAM regression
- Tree-based algorithms that minimize squared error
True for training data; NOT true for future application data

Let's practice!

Supervised Learning in R: Regression