What is feature engineering?

Feature Engineering in R

Jorge Zazueta

Research Professor and Head of the Modeling Group at the School of Economics, UASLP

What is feature engineering?

Feature engineering is the art and science of

  • creating,
  • transforming,
  • extracting, and
  • selecting

variables to improve model performance and interpretability.

Height of an object as a function of time

# A tibble: 100 × 2
    time height
   <dbl>  <dbl>
 1 0       0   
 2 0.101   3.85
 3 0.202  17.7 
 4 0.303  15.1 
 5 0.404  20.0 
 6 0.505  32.6 
 7 0.606  30.8 
 8 0.707  26.6 
 9 0.808  33.8 
10 0.909  39.2 
# ... with 90 more rows
# ? Use `print(n = ...)` to see more rows
Feature Engineering in R

Why engineer features?

We create a simple regression model of height

lr_height <- lm(height ~ time,
                data = height)

and graph it to assess accuracy by sight.

df <- height %>% 
bind_cols(lr_pred = predict(lr_height))

df %>%
  ggplot(aes(x = time, y = height)) +
  geom_point() +
  geom_line(aes(y = lr_pred), 
            color = "blue", lwd = .75)+
    theme_classic()

Our model fails miserably to represent the data!

Linear regression of height vs. time.

Linear regression of height vs. time illustrating lack of fit.

Feature Engineering in R

Using mutate()

The height of an object follows a parabolic path given by the following formula:

$y(t) = y_0 + v_0t - \frac{g}{2}t^2$.

Where $y$ represents the height of the object at time $t$, and $y_0$, $v_0$, and $g$ are, respectively, the initial height, velocity, and acceleration due to gravity.

We can fit our model, recognizing the dependence of height on both time and the square of time.

mutate() takes a data frame as a first argument and the definition of a new variable to be added to the data frame.

df_2 <- df %>% mutate(time_2 = time^2)
# A tibble: 100 × 4
    time height lr_pred time_2
   <dbl>  <dbl>   <dbl>  <dbl>
 1 0       0       80.8 0     
 2 0.101   3.85    80.9 0.0102
 3 0.202  17.7     81.0 0.0408
 4 0.303  15.1     81.1 0.0918
Feature Engineering in R

Predict using the engineered feature

We create another regression model, using our new feature along with the original one.

lr_height_2 <- 
lm(height ~ time + time_2, data = df_2)

And graph our new prediction.

df_2 <- df_2 %>%
    bind_cols(lr2_pred = predict(lr_height_2))
df_2 %>%
  ggplot(aes(x = time, y = height)) +
  geom_point() +
  geom_line(aes(y = lr2_pred), 
      col = "blue", lwd = .75) +
  theme_classic()

That is an impressive improvement without resorting to a different model.

Height vs. time and time_2

Feature Engineering in R

Let's practice!

Feature Engineering in R

Preparing Video For Download...