How linear regression works

Intermediate Regression in R

Richie Cotton

Data Evangelist at DataCamp

The standard simple linear regression plot

A scatter plot with a linear regression trend line.

Intermediate Regression in R

Visualizing residuals

The scatter plot with the linear regression trend line, plus line segments from the points to the trend line, representing residuals.

Intermediate Regression in R

A metric for the best fit

The simplest idea (which doesn't work)

  • Take the sum of all the residuals.
  • Some residuals are negative.

The next simplest idea (which does work)

  • Take the square of each residual, and add up those squares.
  • This is called the sum of squares.
Intermediate Regression in R

A detour into numerical optimization

A line plot of a quadratic equation

xy_data <- tibble(
  x = seq(-4, 5, 0.1),
  y = x ^ 2 - x + 10
)

ggplot(xy_data, aes(x, y)) + 
  geom_line()

line-quad.png

Intermediate Regression in R

Using calculus to solve the equation

$y = x ^ 2 - x + 10$

$\frac{\partial y}{\partial x} = 2 x - 1$

$0 = 2 x - 1$

$x = 0.5$

$y = 0.5 ^ 2 - 0.5 + 10 = 9.75$

  • Not all equations can be solved like this.
  • You can let R figure it out.

line-quad-soln.png

Intermediate Regression in R

optim()

calc_quadratic <- function(x) {
  x ^ 2 - x + 10
}
optim(par = 3, fn = calc_quadratic)
$par
[1] 0.4998047

$value
[1] 9.75

$counts
function gradient 
      30       NA 

$convergence
[1] 0

$message
NULL
Intermediate Regression in R

Slight refinements

calc_quadratic <- function(coeffs) {
  x <- coeffs[1]
  x ^ 2 - x + 10
}
optim(par = c(x = 3), fn = calc_quadratic)
$par
        x 
0.4998047 

$value
[1] 9.75

$counts
function gradient 
      30       NA 

$convergence
[1] 0

$message
NULL
Intermediate Regression in R

A linear regression algorithm

  1. Define a function to calculate the sum of squares metric.
  2. Call optim() to find coefficients that minimize this function.
calc_sum_of_squares <- function(coeffs) {

intercept <- coeffs[1] slope <- coeffs[2]
# More calculation!
}
optim(
  par = ???,
  fn = ???
)
Intermediate Regression in R

Let's practice!

Intermediate Regression in R

Preparing Video For Download...