Supervised Learning in R: Regression
Nina Zumel and John Mount
Win-Vector, LLC
Recommended method when data is plentiful
Preferred when data is not large enough to split off a test set
library(vtreat)
splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)
nRows
: number of rows in the training datanSplits
: number folds (partitions) in the cross-validationlibrary(vtreat)
splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)
First fold (A and B to train, C to test)
splitPlan[[1]]
$train
1 2 4 5 7 9 10
$app
3 6 8
Train on A and B, test on C, etc...
split <- splitPlan[[1]]
model <- lm(fmla, data = df[split$train,])
df$pred.cv[split$app] <- predict(model, newdata = df[split$app,])
Measure type | RMSE | $R^2$ |
---|---|---|
train | 0.7082675 | 0.8029275 |
test | 0.9349416 | 0.7451896 |
cross-validation | 0.8175714 | 0.7635331 |
Supervised Learning in R: Regression