Supervised Learning in R: Regression
Nina Zumel and John Mount
Win-Vector, LLC
model.matrix()xgboost() does notBasic idea:
designTreatmentsZ() to design a treatment plan from the training data, thenprepare() to created "clean" dataprepare() with treatment plan for all future dataTraining Data
| x | u | y | 
|---|---|---|
| one | 44 | 0.4855671 | 
| two | 24 | 1.3683726 | 
| three | 66 | 2.0352837 | 
| two | 22 | 1.6396267 | 
Test Data
| x | u | y | 
|---|---|---|
| one | 5 | 2.6488148 | 
| three | 12 | 1.5012938 | 
| one | 56 | 0.1993731 | 
| two | 28 | 1.2778516 | 
vars <- c("x", "u")
treatplan <- designTreatmentsZ(dframe, varslist, verbose = FALSE)
Inputs to designTreatmentsZ()
dframe: training datavarlist: list of input variable namesThe scoreFrame describes the variable mapping and types
(scoreFrame <- treatplan$scoreFrame %>% 
     select(varName, origName, code))
        varName origName  code
1   x_lev_x.one        x   lev
2 x_lev_x.three        x   lev
3   x_lev_x.two        x   lev
4        x_catP        x  catP
5       u_clean        u clean
Get the names of the new lev and clean variables
(newvars <- scoreFrame %>% 
     filter(code %in% c("clean", "lev")) %>%
     use_series(varName))
"x_lev_x.one"   "x_lev_x.three" "x_lev_x.two"   "u_clean"
training.treat <- prepare(treatmentplan, dframe, varRestriction = newvars)
Inputs to prepare():
treatmentplan: treatment plandframe: data framevarRestriction: list of variables to prepare (optional)Training Data
| x | u | y | 
|---|---|---|
| one | 44 | 0.4855671 | 
| two | 24 | 1.3683726 | 
| three | 66 | 2.0352837 | 
| two | 22 | 1.6396267 | 
Treated Training Data
| x_lev _x. one | x_lev _x. three | x_lev _x. two | u_clean | 
|---|---|---|---|
| 1 | 0 | 0 | 44 | 
| 0 | 0 | 1 | 24 | 
| 0 | 1 | 0 | 66 | 
| 0 | 0 | 1 | 22 | 
(test.treat <- prepare(treatplan, test, varRestriction = newvars))
  x_lev_x.one x_lev_x.three x_lev_x.two u_clean
1           1             0           0       5
2           0             1           0      12
3           1             0           0      56
4           0             0           1      28
Previously unseen x level: four
| x | u | y | 
|---|---|---|
| one | 4 | 0.2331301 | 
| two | 14 | 1.9331760 | 
| three | 66 | 3.1251029 | 
| four | 25 | 4.0332491 | 
four encodes to (0, 0, 0)
prepare(treatplan, toomany, ...)
| x_lev _x. one | x_lev _x. three | x_lev _x. two | u_clean | 
|---|---|---|---|
| 1 | 0 | 0 | 4 | 
| 0 | 0 | 1 | 14 | 
| 0 | 1 | 0 | 66 | 
| 0 | 0 | 0 | 25 | 
Supervised Learning in R: Regression