Out-of-sample error measures

Machine Learning with caret in R

Zach Mayer

Data Scientist at DataRobot and co-author of caret

Out-of-sample error

Want models that don't overfit and generalize well
Do the models perform well on new data?
Test models on new data, or a test set
- Key insight of machine learning
- In-sample validation almost guarantees overfitting
Primary goal of caret and this course: don’t overfit

# Fit a model to the mtcars data
data(mtcars)
model <- lm(mpg ~ hp, mtcars[1:20, ])

# Predict out-of-sample
predicted <- predict(
  model, mtcars[21:32, ], type = "response"
)

# Evaluate error
actual <- mtcars[21:32, "mpg"]
sqrt(mean((predicted - actual) ^ 2))

5.507236

# Fit a model to the full dataset
model2 <- lm(mpg ~ hp, mtcars)

# Predict in-sample
predicted2 <- predict(
  model, mtcars, type = "response"
)

# Evaluate error
actual2 <- mtcars[, "mpg"]
sqrt(mean((predicted2 - actual2) ^ 2))

3.74

Compare to out-of-sample RMSE of 5.5.

Machine Learning with caret in R