Supervised Learning in R: Regression
Nina Zumel and John Mount
Win-Vector, LLC
Regularization: learning rate $\eta \in(0,1)$
$$ M_2 = M_1 + \eta \gamma T_2 $$
Final Model:
$$ M = M_1 + \eta \sum \gamma_i T_i $$
Training error keeps decreasing, but test error doesn't
xgb.cv()
with a large number of rounds (trees).xgb.cv()
with a large number of rounds (trees).xgb.cv()$evaluation_log
: records estimated RMSE for each round.xgb.cv()
with a large number of rounds (trees).xgb.cv()$evaluation_log
: records estimated RMSE for each round.xgboost()
, setting nrounds
= $n_{best}$First, prepare the data
treatplan <- designTreatmentsZ(bikesJan, vars)
newvars <- treatplan$scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
use_series(varName)
bikesJan.treat <- prepare(treatplan, bikesJan, varRestriction = newvars)
For xgboost()
:
as.matrix(bikesJan.treat)
bikesJan$cnt
cv <- xgb.cv(data = as.matrix(bikesJan.treat), label = bikesJan$cnt,
objective = "reg:squarederror",
nrounds = 100, nfold = 5, eta = 0.3, max_depth = 6)
Key inputs to xgb.cv()
and xgboost()
data
: input data as matrix ; label
: outcomeobjective
: for regression - "reg:squarederror"
nrounds
: maximum number of trees to fiteta
: learning ratemax_depth
: maximum depth of individual treesnfold
(xgb.cv()
only): number of folds for cross validationelog <- as.data.frame(cv$evaluation_log)
(nrounds <- which.min(elog$test_rmse_mean))
78
nrounds <- 78
model <- xgboost(data = as.matrix(bikesJan.treat),
label = bikesJan$cnt,
nrounds = nrounds,
objective = "reg:squarederror",
eta = 0.3,
max_depth = 6)
Prepare February data, and predict
bikesFeb.treat <- prepare(treatplan, bikesFeb, varRestriction = newvars)
bikesFeb$pred <- predict(model, as.matrix(bikesFeb.treat))
Model performances on Febrary Data
Model | RMSE |
---|---|
Quasipoisson | 69.3 |
Random forests | 67.15 |
Gradient Boosting | 54.0 |
Predictions vs. Actual Bike Rentals, February
Predictions and Hourly Bike Rentals, February
Supervised Learning in R: Regression