Outliers, leverage, and influence

Introduction to Regression with statsmodels in Python

Maarten Van den Broeck

Content Developer at DataCamp

Roach dataset

roach = fish[fish['species'] == "Roach"]
print(roach.head())
   species  mass_g  length_cm
35   Roach    40.0       12.9
36   Roach    69.0       16.5
37   Roach    78.0       17.5
38   Roach    87.0       18.2
39   Roach   120.0       18.6

A common roach

Introduction to Regression with statsmodels in Python

Which points are outliers?

sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None)
plt.show()

A scatter plot of roach masses versus their lengths with a trend line. Most of the points closely follow the trend line.

Introduction to Regression with statsmodels in Python

Extreme explanatory values

roach["extreme_l"] = ((roach["length_cm"] < 15) |
                    (roach["length_cm"] > 26))

fig = plt.figure()
sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None)

sns.scatterplot(x="length_cm",
                y="mass_g",
                hue="extreme_l",
                data=roach)

A scatter plot of roach masses versus their lengths, with a trend line. Most points are colored blue, but one very short roach and one very long roach have points colored orange.

Introduction to Regression with statsmodels in Python

Response values away from the regression line

roach["extreme_m"] = roach["mass_g"] < 1

fig = plt.figure()
sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None)

sns.scatterplot(x="length_cm",
                y="mass_g",
                hue="extreme_l",
                style="extreme_m",
                data=roach)

A scatter plot of roach masses versus their lengths, with a trend line. Most points are colored blue, but one very short roach and one very long roach have points colored orange. Most points are circles, but one point representing a fish with an apparent mass of zero is a cross.

Introduction to Regression with statsmodels in Python

Leverage and influence

Leverage is a measure of how extreme the explanatory variable values are.

Influence measures how much the model would change if you left the observation out of the dataset when modeling.

A person turning a wrench

Introduction to Regression with statsmodels in Python

.get_influence() and .summary_frame()

mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()

summary_roach = mdl_roach.get_influence().summary_frame()
roach["leverage"] = summary_roach["hat_diag"] print(roach.head())
   species  mass_g  length_cm  leverage
35   Roach    40.0       12.9  0.313729
36   Roach    69.0       16.5  0.125538
37   Roach    78.0       17.5  0.093487
38   Roach    87.0       18.2  0.076283
39   Roach   120.0       18.6  0.068387
Introduction to Regression with statsmodels in Python

Cook's distance

Cook's distance is the most common measure of influence.

roach["cooks_dist"] = summary_roach["cooks_d"]
print(roach.head())
   species  mass_g  length_cm  leverage  cooks_dist
35   Roach    40.0       12.9  0.313729    1.074015
36   Roach    69.0       16.5  0.125538    0.010429
37   Roach    78.0       17.5  0.093487    0.000020
38   Roach    87.0       18.2  0.076283    0.001980
39   Roach   120.0       18.6  0.068387    0.006610
Introduction to Regression with statsmodels in Python

Most influential roaches

print(roach.sort_values("cooks_dist", ascending = False))
   species  mass_g  length_cm  leverage  cooks_dist
35   Roach    40.0       12.9  0.313729    1.074015 # really short roach
54   Roach   390.0       29.5  0.394740    0.365782 # really long roach
40   Roach     0.0       19.0  0.061897    0.311852 # roach with zero mass
52   Roach   290.0       24.0  0.099488    0.150064
51   Roach   180.0       23.6  0.088391    0.061209
..     ...     ...        ...       ...         ...
43   Roach   150.0       20.4  0.050264    0.000257
44   Roach   145.0       20.5  0.050092    0.000256
42   Roach   120.0       19.4  0.056815    0.000199
47   Roach   160.0       21.1  0.050910    0.000137
37   Roach    78.0       17.5  0.093487    0.000020
Introduction to Regression with statsmodels in Python

Removing the most influential roach

roach_not_short = roach[roach["length_cm"] != 12.9]

sns.regplot(x="length_cm",
            y="mass_g",
            data=roach,
            ci=None,
            line_kws={"color": "green"})

sns.regplot(x="length_cm",
            y="mass_g",
            data=roach_not_short,
            ci=None,
            line_kws={"color": "red"})

A scatter plot of roach masses versus their lengths, with two trend lines. One trend line uses all the data, and another excludes the shortest roach. That second trend line has a noticeably steeper gradient.

Introduction to Regression with statsmodels in Python

Let's practice!

Introduction to Regression with statsmodels in Python

Preparing Video For Download...