Feature Engineering in R
Jorge Zazueta
Research Professor and Head of the Modeling Group at the School of Economics, UASLP
Feature engineering is the art and science of
variables to improve model performance and interpretability.
Height of an object as a function of time
# A tibble: 100 × 2
time height
<dbl> <dbl>
1 0 0
2 0.101 3.85
3 0.202 17.7
4 0.303 15.1
5 0.404 20.0
6 0.505 32.6
7 0.606 30.8
8 0.707 26.6
9 0.808 33.8
10 0.909 39.2
# ... with 90 more rows
# ? Use `print(n = ...)` to see more rows
We create a simple regression model of height
lr_height <- lm(height ~ time,
data = height)
and graph it to assess accuracy by sight.
df <- height %>%
bind_cols(lr_pred = predict(lr_height))
df %>%
ggplot(aes(x = time, y = height)) +
geom_point() +
geom_line(aes(y = lr_pred),
color = "blue", lwd = .75)+
theme_classic()
Our model fails miserably to represent the data!
Linear regression of height vs. time.
The height of an object follows a parabolic path given by the following formula:
$y(t) = y_0 + v_0t - \frac{g}{2}t^2$.
Where $y$ represents the height of the object at time $t$, and $y_0$, $v_0$, and $g$ are, respectively, the initial height, velocity, and acceleration due to gravity.
We can fit our model, recognizing the dependence of height on both time and the square of time.
mutate()
takes a data frame as a first argument and the definition of a new variable to be added to the data frame.
df_2 <- df %>% mutate(time_2 = time^2)
# A tibble: 100 × 4
time height lr_pred time_2
<dbl> <dbl> <dbl> <dbl>
1 0 0 80.8 0
2 0.101 3.85 80.9 0.0102
3 0.202 17.7 81.0 0.0408
4 0.303 15.1 81.1 0.0918
We create another regression model, using our new feature along with the original one.
lr_height_2 <-
lm(height ~ time + time_2, data = df_2)
And graph our new prediction.
df_2 <- df_2 %>%
bind_cols(lr2_pred = predict(lr_height_2))
df_2 %>%
ggplot(aes(x = time, y = height)) +
geom_point() +
geom_line(aes(y = lr2_pred),
col = "blue", lwd = .75) +
theme_classic()
That is an impressive improvement without resorting to a different model.
Height vs. time and time_2
Feature Engineering in R