Supervised Learning in R: Regression
Nina Zumel and John Mount
Win-Vector, LLC



Regularization: learning rate $\eta \in(0,1)$
$$ M_2 = M_1 + \eta \gamma T_2 $$

Final Model:
$$ M = M_1 + \eta \sum \gamma_i T_i $$

Training error keeps decreasing, but test error doesn't
xgb.cv() with a large number of rounds (trees).xgb.cv() with a large number of rounds (trees).xgb.cv()$evaluation_log: records estimated RMSE for each round.xgb.cv() with a large number of rounds (trees).xgb.cv()$evaluation_log: records estimated RMSE for each round.xgboost(), setting nrounds = $n_{best}$First, prepare the data
treatplan <- designTreatmentsZ(bikesJan, vars)
newvars <- treatplan$scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
use_series(varName)
bikesJan.treat <- prepare(treatplan, bikesJan, varRestriction = newvars)
For xgboost():
as.matrix(bikesJan.treat)bikesJan$cntcv <- xgb.cv(data = as.matrix(bikesJan.treat), label = bikesJan$cnt,
objective = "reg:squarederror",
nrounds = 100, nfold = 5, eta = 0.3, max_depth = 6)
Key inputs to xgb.cv() and xgboost()
data: input data as matrix ; label: outcomeobjective: for regression - "reg:squarederror"nrounds: maximum number of trees to fiteta: learning ratemax_depth: maximum depth of individual treesnfold (xgb.cv() only): number of folds for cross validation
elog <- as.data.frame(cv$evaluation_log)
(nrounds <- which.min(elog$test_rmse_mean))
78
nrounds <- 78
model <- xgboost(data = as.matrix(bikesJan.treat),
label = bikesJan$cnt,
nrounds = nrounds,
objective = "reg:squarederror",
eta = 0.3,
max_depth = 6)
Prepare February data, and predict
bikesFeb.treat <- prepare(treatplan, bikesFeb, varRestriction = newvars)
bikesFeb$pred <- predict(model, as.matrix(bikesFeb.treat))
Model performances on Febrary Data
| Model | RMSE |
|---|---|
| Quasipoisson | 69.3 |
| Random forests | 67.15 |
| Gradient Boosting | 54.0 |
Predictions vs. Actual Bike Rentals, February

Predictions and Hourly Bike Rentals, February

Supervised Learning in R: Regression