A tale of two variables

Introduzione alla regressione in R

Richie Cotton

Data Evangelist at DataCamp

Swedish motor insurance data

  • Each row represents one geographic region in Sweden.
  • There are 63 rows.
n_claims total_payment_sek
108 392.5
19 46.2
13 15.7
124 422.2
40 119.4
... ...
Introduzione alla regressione in R

Descriptive statistics

library(dplyr)
swedish_motor_insurance %>% 
  summarize_all(mean)
# A tibble: 1 x 2
  n_claims total_payment_sek
     <dbl>             <dbl>
1     22.9              98.2
swedish_motor_insurance %>% 
  summarize(
    correlation = cor(n_claims, total_payment_sek)
  )
# A tibble: 1 x 1
  correlation
        <dbl>
1       0.881
Introduzione alla regressione in R

What is regression?

  • Statistical models to explore the relationship a response variable and some explanatory variables.
  • Given values of explanatory variables, you can predict the values of the response variable.
n_claims total_payment_sek
108 392.5
19 46.2
13 15.7
124 422.2
40 119.4
200 ???
Introduzione alla regressione in R

Jargon

Response variable (a.k.a. dependent variable)

The variable that you want to predict.

Explanatory variables (a.k.a. independent variables)

The variables that explain how the response variable will change.

Introduzione alla regressione in R

Linear regression and logistic regression

Linear regression

  • The response variable is numeric.

Logistic regression

  • The response variable is logical.

Simple linear/logistic regression

  • There is only one explanatory variable.
Introduzione alla regressione in R

Visualizing pairs of variables

library(ggplot2)

ggplot(
  swedish_motor_insurance, 
  aes(n_claims, total_payment_sek)
) +
  geom_point()

A scatter plot of the total payment versus the number of claims. The payment increases as the number of claims increases.

Introduzione alla regressione in R

Adding a linear trend line

library(ggplot2)

ggplot(
  swedish_motor_insurance, 
  aes(n_claims, total_payment_sek)
) +
  geom_point() +
  geom_smooth(
    method = "lm", 
    se = FALSE
  )

The same scatter plot seen previously, now with an additional trend line calculated via linear regression. It provides a reasonable fit to the data.

Introduzione alla regressione in R

Course flow

Chapter 1

Visualizing and fitting linear regression models.

Chapter 2

Making predictions from linear regression models and understanding model coefficients.

Chapter 3

Assessing the quality of the linear regression model.

Chapter 4

Same again, but with logistic regression models

Introduzione alla regressione in R

Let's practice!

Introduzione alla regressione in R

Preparing Video For Download...