Supervised Learning in R: Regression
Nina Zumel and John Mount
Win-Vector, LLC
model.matrix()
xgboost()
does notBasic idea:
designTreatmentsZ()
to design a treatment plan from the training data, thenprepare()
to created "clean" dataprepare()
with treatment plan for all future dataTraining Data
x | u | y |
---|---|---|
one | 44 | 0.4855671 |
two | 24 | 1.3683726 |
three | 66 | 2.0352837 |
two | 22 | 1.6396267 |
Test Data
x | u | y |
---|---|---|
one | 5 | 2.6488148 |
three | 12 | 1.5012938 |
one | 56 | 0.1993731 |
two | 28 | 1.2778516 |
vars <- c("x", "u")
treatplan <- designTreatmentsZ(dframe, varslist, verbose = FALSE)
Inputs to designTreatmentsZ()
dframe
: training datavarlist
: list of input variable namesThe scoreFrame describes the variable mapping and types
(scoreFrame <- treatplan$scoreFrame %>%
select(varName, origName, code))
varName origName code
1 x_lev_x.one x lev
2 x_lev_x.three x lev
3 x_lev_x.two x lev
4 x_catP x catP
5 u_clean u clean
Get the names of the new lev
and clean
variables
(newvars <- scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
use_series(varName))
"x_lev_x.one" "x_lev_x.three" "x_lev_x.two" "u_clean"
training.treat <- prepare(treatmentplan, dframe, varRestriction = newvars)
Inputs to prepare()
:
treatmentplan
: treatment plandframe
: data framevarRestriction
: list of variables to prepare (optional)Training Data
x | u | y |
---|---|---|
one | 44 | 0.4855671 |
two | 24 | 1.3683726 |
three | 66 | 2.0352837 |
two | 22 | 1.6396267 |
Treated Training Data
x_lev _x. one | x_lev _x. three | x_lev _x. two | u_clean |
---|---|---|---|
1 | 0 | 0 | 44 |
0 | 0 | 1 | 24 |
0 | 1 | 0 | 66 |
0 | 0 | 1 | 22 |
(test.treat <- prepare(treatplan, test, varRestriction = newvars))
x_lev_x.one x_lev_x.three x_lev_x.two u_clean
1 1 0 0 5
2 0 1 0 12
3 1 0 0 56
4 0 0 1 28
Previously unseen x
level: four
x | u | y |
---|---|---|
one | 4 | 0.2331301 |
two | 14 | 1.9331760 |
three | 66 | 3.1251029 |
four | 25 | 4.0332491 |
four encodes to (0, 0, 0)
prepare(treatplan, toomany, ...)
x_lev _x. one | x_lev _x. three | x_lev _x. two | u_clean |
---|---|---|---|
1 | 0 | 0 | 4 |
0 | 0 | 1 | 14 |
0 | 1 | 0 | 66 |
0 | 0 | 0 | 25 |
Supervised Learning in R: Regression