Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
Use machine learning models to predict missing values!
This course: missForest
package, based on randomForest
nhanes %>% is.na() %>% colSums()
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
0 0 9 8 1 85 32 26
library(missForest)
imp_res <- missForest(nhanes)
nhanes_imp <- imp_res$ximp
nhanes_imp %>% is.na() %>% colSums()
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
0 0 0 0 0 0 0 0
missForest()
provides an out-of-bag (OOB) imputation error estimate:
In both cases, good performance leads to a value close to 0 and values around 1 indicate a poor result.
imp_res <- missForest(nhanes)
imp_res$OOBerror
NRMSE PFC
0.147687025 0.003676471
missForest()
provides an out-of-bag (OOB) imputation error estimate:
In both cases, good performance leads to a value close to 0 and values around 1 indicate a poor result.
imp_res <- missForest(nhanes, variablewise = TRUE)
imp_res$OOBerror
MSE PFC MSE MSE PFC MSE MSE MSE
0.00000 0.00000 285.79563 40.42142 0.00735 0.53444 129.03609 0.17576
Growing multiple random forests can be time-consuming.
Idea: sacrifice some accuracy and reduce the forest size to decrease computation time.
ntree
argument).mtry
argument).The effect on computation time differs:
ntree
has a linear effect.mtry
increases speed more when there are many variables.Default settings:
start_time <- Sys.time()
imp_res <- missForest(nhanes)
end_time <- Sys.time()
print(imp_res$OOBerror)
print(end_time - start_time)
NRMSE PFC
0.147687025 0.003676471
Time difference of 5.496582 secs
Reduced forests:
start_time <- Sys.time()
imp_res <- missForest(nhanes,
ntree = 10,
mtry = 2)
end_time <- Sys.time()
print(imp_res$OOBerror)
print(end_time - start_time)
NRMSE PFC
0.162420139 0.007425743
Time difference of 0.516367 secs
Handling Missing Data with Imputations in R