Multiple preprocessing methods

Machine Learning with caret in R

Zach Mayer

Data Scientist at DataRobot and co-author of caret

The wide world of preProcess

  • You can do a lot more than median or knn imputation!
  • Can chain together multiple preprocessing steps
  • Common "recipe" for linear models (order matters!)
    • Median imputation ⇒ center ⇒ scale ⇒ fit glm
  • See ?preProcess for more detail
Machine Learning with caret in R

Example: preprocessing mtcars

# Generate some data with missing values
data(mtcars)
set.seed(42)
mtcars[sample(1:nrow(mtcars), 10), "hp"] <- NA
Y <- mtcars$mpg
X <- mtcars[,2:4] # <- Missing at random
# Use linear model "recipe"
set.seed(42)
model <- train(
  X, Y, method = "glm",
  preProcess = c("center", "scale", "medianImpute")
)
print(min(model$results$RMSE))
3.612713
Machine Learning with caret in R

Example: preprocessing mtcars

# PCA before modeling
set.seed(42)
model <- train(
  X, Y, method = "glm",
  preProcess = c("center", "scale", "medianImpute", "pca")
)
min(model$results$RMSE)
3.402557
Machine Learning with caret in R

Example: preprocessing mtcars

# Spatial sign transform
set.seed(42)
model <- train(
  X, Y, method = "glm",
  preProcess = c("center", "scale", "medianImpute", "spatialSign")
)
min(model$results$RMSE)
4.284904
Machine Learning with caret in R

Preprocessing cheat sheet

  • Start with median imputation
  • Try KNN imputation if data missing not at random
  • For linear models ...
    • Center and scale
    • Try PCA and spatial sign
  • Tree-based models don't need much preprocessing
Machine Learning with caret in R

Let’s practice!

Machine Learning with caret in R

Preparing Video For Download...