Supervised Learning in R: Regression
Nina Zumel and John Mount
Win-Vector, LLC


Recommended method when data is plentiful



Preferred when data is not large enough to split off a test set



library(vtreat)
splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)
nRows: number of rows in the training datanSplits: number folds (partitions) in the cross-validationlibrary(vtreat)
splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)
First fold (A and B to train, C to test)
splitPlan[[1]]
$train
1  2  4  5  7  9 10
$app
3 6 8
Train on A and B, test on C, etc...
split <- splitPlan[[1]]
model <- lm(fmla, data = df[split$train,])
df$pred.cv[split$app] <- predict(model, newdata = df[split$app,])

| Measure type | RMSE | $R^2$ | 
|---|---|---|
| train | 0.7082675 | 0.8029275 | 
| test | 0.9349416 | 0.7451896 | 
| cross-validation | 0.8175714 | 0.7635331 | 
Supervised Learning in R: Regression