A tale of two variables

Introduction to Regression in R

Richie Cotton

Data Evangelist at DataCamp

Swedish motor insurance data

  • Each row represents one geographic region in Sweden.
  • There are 63 rows.
n_claims total_payment_sek
108 392.5
19 46.2
13 15.7
124 422.2
40 119.4
... ...
Introduction to Regression in R

Descriptive statistics

library(dplyr)
swedish_motor_insurance %>% 
  summarize_all(mean)
# A tibble: 1 x 2
  n_claims total_payment_sek
     <dbl>             <dbl>
1     22.9              98.2
swedish_motor_insurance %>% 
  summarize(
    correlation = cor(n_claims, total_payment_sek)
  )
# A tibble: 1 x 1
  correlation
        <dbl>
1       0.881
Introduction to Regression in R

What is regression?

  • Statistical models to explore the relationship a response variable and some explanatory variables.
  • Given values of explanatory variables, you can predict the values of the response variable.
n_claims total_payment_sek
108 392.5
19 46.2
13 15.7
124 422.2
40 119.4
200 ???
Introduction to Regression in R

Jargon

Response variable (a.k.a. dependent variable)

The variable that you want to predict.

Explanatory variables (a.k.a. independent variables)

The variables that explain how the response variable will change.

Introduction to Regression in R

Linear regression and logistic regression

Linear regression

  • The response variable is numeric.

Logistic regression

  • The response variable is logical.

Simple linear/logistic regression

  • There is only one explanatory variable.
Introduction to Regression in R

Visualizing pairs of variables

library(ggplot2)

ggplot(
  swedish_motor_insurance, 
  aes(n_claims, total_payment_sek)
) +
  geom_point()

A scatter plot of the total payment versus the number of claims. The payment increases as the number of claims increases.

Introduction to Regression in R

Adding a linear trend line

library(ggplot2)

ggplot(
  swedish_motor_insurance, 
  aes(n_claims, total_payment_sek)
) +
  geom_point() +
  geom_smooth(
    method = "lm", 
    se = FALSE
  )

The same scatter plot seen previously, now with an additional trend line calculated via linear regression. It provides a reasonable fit to the data.

Introduction to Regression in R

Course flow

Chapter 1

Visualizing and fitting linear regression models.

Chapter 2

Making predictions from linear regression models and understanding model coefficients.

Chapter 3

Assessing the quality of the linear regression model.

Chapter 4

Same again, but with logistic regression models

Introduction to Regression in R

Let's practice!

Introduction to Regression in R

Preparing Video For Download...