Median imputation

Machine Learning with caret in R

Max Kuhn

Software Engineer at RStudio and creator of caret

Dealing with missing values

  • Most models require numbers, can’t handle missing data
  • Common approach: remove rows with missing data
    • Can lead to biases in data
    • Generate over-confident models
  • Better strategy: median imputation!
    • Replace missing values with medians
    • Works well if data missing at random (MAR)
Machine Learning with caret in R

Example: mtcars

# Generate some data with missing values
data(mtcars)
set.seed(42)
mtcars[sample(1:nrow(mtcars), 10), "hp"] <- NA
# Split target from predictors
Y <- mtcars$mpg
X <- mtcars[, 2:4]
# Try to fit a caret model
library(caret)
model <- train(X, Y)
Error in train.default(X, Y) : Stopping 
Machine Learning with caret in R

A simple solution

# Now fit with median imputation
model <- train(X, Y, preProcess = "medianImpute")
print(model)
Random Forest 

32 samples
 3 predictor

Pre-processing: median imputation (3) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared 
  2     2.617096  0.8234652
  3     2.670550  0.8164535

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2. 
Machine Learning with caret in R

Let’s practice!

Machine Learning with caret in R

Preparing Video For Download...