Regression to the mean

Introduction to Regression with statsmodels in Python

Maarten Van den Broeck

Content Developer at DataCamp

The concept

  • Response value = fitted value + residual
  • "The stuff you explained" + "the stuff you couldn't explain"
  • Residuals exist due to problems in the model and fundamental randomness
  • Extreme cases are often due to randomness
  • Regression to the mean means extreme cases don't persist over time
Introduction to Regression with statsmodels in Python

Pearson's father son dataset

  • 1078 father/son pairs
  • Do tall fathers have tall sons?
father_height_cm son_height_cm
165.2 151.8
160.7 160.6
165.0 160.9
167.0 159.5
155.3 163.3
... ...
1 Adapted from https://www.rdocumentation.org/packages/UsingR/topics/father.son
Introduction to Regression with statsmodels in Python

Scatter plot

fig = plt.figure()
sns.scatterplot(x="father_height_cm",
                y="son_height_cm",
                data=father_son)
plt.axline(xy1=(150, 150),
           slope=1,
           linewidth=2,
           color="green")
plt.axis("equal")
plt.show()

A scatter plot of sons' heights versus fathers' heights, with a line where the father and son would be the same height. As fathers get taller, so do the sons.

Introduction to Regression with statsmodels in Python

Adding a regression line

fig = plt.figure()

sns.regplot(x="father_height_cm",
            y="son_height_cm",
            data=father_son,
            ci = None, 
            line_kws={"color": "black"})

plt.axline(xy1 = (150, 150),
           slope=1,
           linewidth=2,
           color="green")

plt.axis("equal")
plt.show()

The scatter plot of sons' heights versus fathers' heights, annotated with a linear trend line. The trend line is less steep than the line where fathers and sons would be the same height.

Introduction to Regression with statsmodels in Python

Running a regression

mdl_son_vs_father = ols("son_height_cm ~ father_height_cm",
                        data = father_son).fit()
print(mdl_son_vs_father.params)
Intercept           86.071975
father_height_cm     0.514093
dtype: float64
Introduction to Regression with statsmodels in Python

Making predictions

really_tall_father = pd.DataFrame(
  {"father_height_cm": [190]})

mdl_son_vs_father.predict(
  really_tall_father)
183.7
really_short_father = pd.DataFrame(
  {"father_height_cm": [150]})

mdl_son_vs_father.predict(
  really_short_father)
163.2
Introduction to Regression with statsmodels in Python

Let's practice!

Introduction to Regression with statsmodels in Python

Preparing Video For Download...