Gestione dei dati mancanti con imputazioni in R
Michal Oleszak
Machine Learning Engineer



Il modello per ogni variabile dipende dal suo tipo:
Imputa Height e Weight in nhanes con un modello lineare:
library(simputation)
nhanes_imp <- impute_lm(nhanes, Height + Weight ~ .)
Verifica che siano stati imputati:
nhanes_imp %>%
is.na() %>%
colSums()
Age Gender Weight Height Diabetes TotChol Pulse PhysActive
0 0 32 30 1 85 32 26
Inizializza i valori mancanti con hotdeck e salva le posizioni mancanti:
nhanes_imp <- hotdeck(nhanes)
missing_height <- nhanes_imp$Height_imp
missing_weight <- nhanes_imp$Weight_imp
Itera su Height e Weight 5 volte, imputando nelle posizioni originariamente mancanti:
for (i in 1:5) {
nhanes_imp$Height[missing_height] <- NA
nhanes_imp <- impute_lm(nhanes_imp, Height ~ Age + Gender + Weight)
nhanes_imp$Weight[missing_weight] <- NA
nhanes_imp <- impute_lm(nhanes_imp, Weight ~ Age + Gender + Height)
}
for (i in 1:5) {
nhanes_imp$Height[missing_height] <- NA
nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
nhanes_imp$Weight[missing_weight] <- NA
nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)
}
diff_height <- c()
diff_weight <- c()
for (i in 1:5) {
nhanes_imp$Height[missing_height] <- NA
nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
nhanes_imp$Weight[missing_weight] <- NA
nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)
}
diff_height <- c()
diff_weight <- c()
for (i in 1:5) {
prev_iter <- nhanes_imp
nhanes_imp$Height[missing_height] <- NA
nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
nhanes_imp$Weight[missing_weight] <- NA
nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)
}
diff_height <- c()
diff_weight <- c()
for (i in 1:5) {
prev_iter <- nhanes_imp
nhanes_imp$Height[missing_height] <- NA
nhanes_imp <- impute_lm(nhanes, Height ~ Age + Gender + Weight)
nhanes_imp$Weight[missing_weight] <- NA
nhanes_imp <- impute_lm(nhanes, Weight ~ Age + Gender + Height)
diff_height <- c(diff_height, mapc(prev_iter$Height, nhanes_imp$Height))
diff_weight <- c(diff_weight, mapc(prev_iter$Weight, nhanes_imp$Weight))
}

Gestione dei dati mancanti con imputazioni in R