KNN imputation

Machine Learning with caret in R

Zach Mayer

Data Scientist at DataRobot and co-author of caret

Dealing with missing values

  • Median imputation is fast, but…
  • Can produce incorrect results if data missing not at random
  • k-nearest neighbors (KNN) imputation
  • Imputes based on "similar" non-missing rows
Machine Learning with caret in R

Example: missing not at random

  • Pretend smaller cars don’t report horsepower
  • Median imputation incorrect in this case: it assumes small cars have medium-large horsepower
# Generate data with missing values
mtcars[mtcars$disp < 140, "hp"] <- NA
Y <- mtcars$mpg
X <- mtcars[, 2:4]

# Use median imputation
model <- train(X, Y, method = "glm", preProcess = "medianImpute")
print(min(model$results$RMSE))
3.612713
Machine Learning with caret in R

Example: missing not at random

  • KNN imputation is better
  • Uses cars with similar disp / cyl to impute
  • Yields a more accurate (but slower) model
# Use KNN imputation
set.seed(42)
model <- train(
  X, Y, method = "glm", preProcess = "knnImpute"
)
print(min(model$results$RMSE))
3.558881

Compare to 3.61 for median imputation

Machine Learning with caret in R

Let’s practice!

Machine Learning with caret in R

Preparing Video For Download...