Handling Missing Data with Imputations in R
Michal Oleszak
Machine Learning Engineer
For each observation with missing values:
For each observation with missing values:
The distance between two observations a and b:
Euclidean distance for n numeric variables:
$\sqrt{\Sigma_{i=1}^{n} (a_i - b_i)^{2}}$
Manhattan distance for f factor variables:
$\Sigma_{i=1}^{f} |a_i - b_i|$
Hamming distance for c categorical variables:
$\Sigma_{i=1}^{c} I(a_i \neq b_i)$
library(VIM)
nhanes_imp <- kNN(nhanes, k = 5, variable = c("TotChol", "Pulse"))
head(nhanes_imp)
Age Gender Weight Height Diabetes TotChol Pulse PhysActive TotChol_imp Pulse_imp
1 16 male 73.2 172.0 FALSE 3.00 76 TRUE FALSE FALSE
2 17 male 72.3 176.0 FALSE 2.61 74 TRUE FALSE FALSE
3 12 male 57.7 158.9 FALSE 4.27 80 TRUE FALSE FALSE
4 16 male 88.9 183.3 FALSE 3.62 58 TRUE FALSE FALSE
5 13 female 45.1 157.6 FALSE 2.66 92 TRUE FALSE FALSE
6 16 female 48.7 158.4 FALSE 4.32 58 FALSE FALSE FALSE
nhanes_imp <- nhanes %>%
kNN(variable = c("TotChol", "Pulse"),
k = 5,
numFun = weighted.mean,
weightDist = TRUE)
vars_by_NAs <- nhanes %>%
is.na() %>%
colSums() %>%
sort(decreasing = FALSE) %>%
names()
nhanes_imp <- nhanes %>%
select(vars_by_NAs) %>%
kNN(k = 5)
Handling Missing Data with Imputations in R