A tale of two variables

Introduction to Regression with statsmodels in Python

Maarten Van den Broeck

Content Developer at DataCamp

Swedish motor insurance data

  • Each row represents one geographic region in Sweden.
  • There are 63 rows.
n_claims total_payment_sek
108 392.5
19 46.2
13 15.7
124 422.2
40 119.4
... ...
Introduction to Regression with statsmodels in Python

Descriptive statistics

import pandas as pd
print(swedish_motor_insurance.mean())
n_claims             22.904762
total_payment_sek    98.187302
dtype: float64
print(swedish_motor_insurance['n_claims'].corr(swedish_motor_insurance['total_payment_sek']))
0.9128782350234068
Introduction to Regression with statsmodels in Python

What is regression?

  • Statistical models to explore the relationship between a response variable and some explanatory variables.
  • Given values of explanatory variables, you can predict the values of the response variable.
n_claims total_payment_sek
108 3925
19 462
13 157
124 4222
40 1194
200 ???
Introduction to Regression with statsmodels in Python

Jargon

Response variable (a.k.a. dependent variable)

The variable that you want to predict.

Explanatory variables (a.k.a. independent variables)

The variables that explain how the response variable will change.

Introduction to Regression with statsmodels in Python

Linear regression and logistic regression

Linear regression

  • The response variable is numeric.

Logistic regression

  • The response variable is logical.

Simple linear/logistic regression

  • There is only one explanatory variable.
Introduction to Regression with statsmodels in Python

Visualizing pairs of variables

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x="n_claims",
                y="total_payment_sek",    
                data=swedish_motor_insurance)

plt.show()

A scatter plot of the total payment versus the number of claims. The payment increases as the number of claims increases.

Introduction to Regression with statsmodels in Python

Adding a linear trend line

sns.regplot(x="n_claims",
            y="total_payment_sek",
            data=swedish_motor_insurance,
            ci=None)

The same scatter plot seen previously, now with an additional trend line calculated via linear regression. It provides a reasonable fit to the data.

Introduction to Regression with statsmodels in Python

Course flow

Chapter 1

Visualizing and fitting linear regression models.

Chapter 2

Making predictions from linear regression models and understanding model coefficients.

Chapter 3

Assessing the quality of the linear regression model.

Chapter 4

Same again, but with logistic regression models

Introduction to Regression with statsmodels in Python

Python packages for regression

statsmodels

  • Optimized for insight (focus in this course)

scikit-learn

  • Optimized for prediction (focus in other DataCamp courses)
Introduction to Regression with statsmodels in Python

Let's practice!

Introduction to Regression with statsmodels in Python

Preparing Video For Download...