Properly Training a Model

Supervised Learning in R: Regression

Nina Zumel and John Mount

Win-Vector, LLC

Models can perform much better on training than they do on future data.

Recommended method when data is plentiful

Preferred when data is not large enough to split off a test set

library(vtreat)
splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)

nRows: number of rows in the training data
nSplits: number folds (partitions) in the cross-validation
- e.g, nfolds = 3 for 3-way cross-validation
remaining 2 arguments not needed here

library(vtreat)
splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)

First fold (A and B to train, C to test)

splitPlan[[1]]

$train
1  2  4  5  7  9 10
$app
3 6 8

Train on A and B, test on C, etc...

split <- splitPlan[[1]]
model <- lm(fmla, data = df[split$train,])
df$pred.cv[split$app] <- predict(model, newdata = df[split$app,])

Supervised Learning in R: Regression