One-Hot-Encoding Categorical Variables

Supervised Learning in R: Regression

Nina Zumel and John Mount

Win-Vector, LLC

Why Convert Categoricals Manually?

  • Most R functions manage the conversion for you
    • model.matrix()
  • xgboost() does not
    • Must convert categorical variables to numeric representation
  • Conversion to indicators: one-hot encoding
Supervised Learning in R: Regression

One-hot-encoding and data cleaning with `vtreat`

Basic idea:

  • designTreatmentsZ() to design a treatment plan from the training data, then
  • prepare() to created "clean" data
    • all numerical
    • no missing values
      • use prepare() with treatment plan for all future data
Supervised Learning in R: Regression

A Small vtreat Example

Training Data

x u y
one 44 0.4855671
two 24 1.3683726
three 66 2.0352837
two 22 1.6396267

Test Data

x u y
one 5 2.6488148
three 12 1.5012938
one 56 0.1993731
two 28 1.2778516
Supervised Learning in R: Regression

Create the Treatment Plan

vars <- c("x", "u")
treatplan <- designTreatmentsZ(dframe, varslist, verbose = FALSE)

Inputs to designTreatmentsZ()

  • dframe: training data
  • varlist: list of input variable names
  • set verbose = FALSE to suppress progress messages
Supervised Learning in R: Regression

Get the New Variables

The scoreFrame describes the variable mapping and types

(scoreFrame <- treatplan$scoreFrame %>% 
     select(varName, origName, code))
        varName origName  code
1   x_lev_x.one        x   lev
2 x_lev_x.three        x   lev
3   x_lev_x.two        x   lev
4        x_catP        x  catP
5       u_clean        u clean

Get the names of the new lev and clean variables

(newvars <- scoreFrame %>% 
     filter(code %in% c("clean", "lev")) %>%
     use_series(varName))
"x_lev_x.one"   "x_lev_x.three" "x_lev_x.two"   "u_clean"
Supervised Learning in R: Regression

Prepare the Training Data for Modeling

training.treat <- prepare(treatmentplan, dframe, varRestriction = newvars)

Inputs to prepare():

  • treatmentplan: treatment plan
  • dframe: data frame
  • varRestriction: list of variables to prepare (optional)
    • default: prepare all variables
Supervised Learning in R: Regression

Before and After Data Treatment

Training Data

x u y
one 44 0.4855671
two 24 1.3683726
three 66 2.0352837
two 22 1.6396267

Treated Training Data

x_lev _x. one x_lev _x. three x_lev _x. two u_clean
1 0 0 44
0 0 1 24
0 1 0 66
0 0 1 22
Supervised Learning in R: Regression

Prepare the Test Data Before Model Application

(test.treat <- prepare(treatplan, test, varRestriction = newvars))
  x_lev_x.one x_lev_x.three x_lev_x.two u_clean
1           1             0           0       5
2           0             1           0      12
3           1             0           0      56
4           0             0           1      28
Supervised Learning in R: Regression

vtreat Treatment is Robust

Previously unseen x level: four

x u y
one 4 0.2331301
two 14 1.9331760
three 66 3.1251029
four 25 4.0332491

four encodes to (0, 0, 0)

prepare(treatplan, toomany, ...)
x_lev _x. one x_lev _x. three x_lev _x. two u_clean
1 0 0 4
0 0 1 14
0 1 0 66
0 0 0 25
Supervised Learning in R: Regression

Let's practice!

Supervised Learning in R: Regression

Preparing Video For Download...