Transforming variables

Introduction to Regression with statsmodels in Python

Maarten Van den Broeck

Content Developer at DataCamp

Perch dataset

perch = fish[fish["species"] == "Perch"]
print(perch.head())
   species  mass_g  length_cm
55   Perch     5.9        7.5
56   Perch    32.0       12.5
57   Perch    40.0       13.8
58   Perch    51.5       15.0
59   Perch    70.0       15.7

European perch, _Perca fluviatilis_

Introduction to Regression with statsmodels in Python

It's not a linear relationship

sns.regplot(x="length_cm",
            y="mass_g",
            data=perch,
            ci=None)

plt.show()

A scatter plot of perch masses versus their lengths, with a trend line. The perches get heavier faster than linearly as they get longer, resulting in an upward curve.

Introduction to Regression with statsmodels in Python

Bream vs. perch

A common bream. Bream are quite flat.

A European perch. Perch are quite round.

Introduction to Regression with statsmodels in Python

Plotting mass vs. length cubed

perch["length_cm_cubed"] = perch["length_cm"] ** 3
sns.regplot(x="length_cm_cubed",
            y="mass_g",
            data=perch,
            ci=None)
plt.show()

A scatter plot of perch masses versus their lengths cubed, with a trend line. After this transformation, the points are mostly close to the trend line.

Introduction to Regression with statsmodels in Python

Modeling mass vs. length cubed

perch["length_cm_cubed"] = perch["length_cm"] ** 3

mdl_perch = ols("mass_g ~ length_cm_cubed", data=perch).fit()
mdl_perch.params
Intercept         -0.117478
length_cm_cubed    0.016796
dtype: float64
Introduction to Regression with statsmodels in Python

Predicting mass vs. length cubed

explanatory_data = pd.DataFrame({"length_cm_cubed": np.arange(10, 41, 5) ** 3,
                                 "length_cm": np.arange(10, 41, 5)})
prediction_data = explanatory_data.assign(
  mass_g=mdl_perch.predict(explanatory_data))
print(prediction_data)
   length_cm_cubed  length_cm       mass_g
0             1000         10    16.678135
1             3375         15    56.567717
2             8000         20   134.247429
3            15625         25   262.313982
4            27000         30   453.364084
5            42875         35   719.994447
6            64000         40  1074.801781
Introduction to Regression with statsmodels in Python

Plotting mass vs. length cubed

fig = plt.figure()
sns.regplot(x="length_cm_cubed", y="mass_g",
            data=perch, ci=None)
sns.scatterplot(data=prediction_data,
                x="length_cm_cubed", y="mass_g",
                color="red", marker="s")

The scatter plot of perch masses versus their lengths cubed, with a trend line, annotated with points calculated from the predict() function. The points follow the trend line exactly.

fig = plt.figure()
sns.regplot(x="length_cm", y="mass_g",
            data=perch, ci=None)
sns.scatterplot(data=prediction_data,
                x="length_cm", y="mass_g",
                color="red", marker="s")

The scatter plot of perch masses versus their lengths, with a trend line, annotated with points calculated from the predict() function. The points don't follow the trend line but they do follow the curve of the data points.

Introduction to Regression with statsmodels in Python

Facebook advertising dataset

How advertising works

  1. Pay Facebook to shows ads.
  2. People see the ads ("impressions").
  3. Some people who see it, click it.

 

  • 936 rows
  • Each row represents 1 advert
spent_usd n_impressions n_clicks
1.43 7350 1
1.82 17861 2
1.25 4259 1
1.29 4133 1
4.77 15615 3
... ... ...
Introduction to Regression with statsmodels in Python

Plot is cramped

sns.regplot(x="spent_usd",
            y="n_impressions",
            data=ad_conversion,
            ci=None)

A scatter plot of number of impressions versus advertising spend, with a trend line. Most data points are crammed into the bottom left of the plot.

Introduction to Regression with statsmodels in Python

Square root vs square root

ad_conversion["sqrt_spent_usd"] = np.sqrt(
  ad_conversion["spent_usd"])

ad_conversion["sqrt_n_impressions"] = np.sqrt(
  ad_conversion["n_impressions"])

sns.regplot(x="sqrt_spent_usd",
            y="sqrt_n_impressions",
            data=ad_conversion,
            ci=None)

A scatter plot of the square root of the number of impressions versus the square root of the advertising spend, with a trend line. Now the points are more evenly spread throughout the plot.

Introduction to Regression with statsmodels in Python

Modeling and predicting

mdl_ad = ols("sqrt_n_impressions ~ sqrt_spent_usd", data=ad_conversion).fit()
explanatory_data = pd.DataFrame({"sqrt_spent_usd": np.sqrt(np.arange(0, 601, 100)),
                                 "spent_usd": np.arange(0, 601, 100)})
prediction_data = explanatory_data.assign(sqrt_n_impressions=mdl_ad.predict(explanatory_data),
                          n_impressions=mdl_ad.predict(explanatory_data) ** 2)
print(prediction_data)
   sqrt_spent_usd  spent_usd  sqrt_n_impressions  n_impressions
0        0.000000          0           15.319713   2.346936e+02
1       10.000000        100          597.736582   3.572890e+05
2       14.142136        200          838.981547   7.038900e+05
3       17.320508        300         1024.095320   1.048771e+06
4       20.000000        400         1180.153450   1.392762e+06
5       22.360680        500         1317.643422   1.736184e+06
6       24.494897        600         1441.943858   2.079202e+06
Introduction to Regression with statsmodels in Python

Let's practice!

Introduction to Regression with statsmodels in Python

Preparing Video For Download...