k-Nearest-Neighbors imputation

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

k-Nearest-Neighbors imputation

A table with three columns: A, B and C. There is one missing value in column A.

k-Nearest-Neighbors imputation

A table with three columns: A, B and C. There is one missing value in column A. Three of the rows not containing the missing value are highlighted in color.

For each observation with missing values:

Find other k observations (donors, neighbors) that are most similar to that observation.

k-Nearest-Neighbors imputation

A table with three columns: A, B and C. Three of the rows not containing the missing value are highlighted in color. The previously missing value in A has been replaced with the mean of the numbers in the same column in the highlighted rows.

For each observation with missing values:

Find other k observations (donors, neighbors) that are most similar to that observation.
Replace the missing values with aggregated values from the k donors (mean, median, mode).

Distance measures

The distance between two observations a and b:

Euclidean distance for n numeric variables:

$\sqrt{\Sigma_{i=1}^{n} (a_i - b_i)^{2}}$

Manhattan distance for f factor variables:

$\Sigma_{i=1}^{f} |a_i - b_i|$

Hamming distance for c categorical variables:

$\Sigma_{i=1}^{c} I(a_i \neq b_i)$

A coordinate system with two points connected with a straight line.

A coordinate system with two points connected with two perpendicular lines, as though the points were two opposite corners of a rectangle.

Gower distance

A mocked data frame with containing three types of variables, each highlighted with a different color: numeric, factor and categorical variables.

Gower distance

kNN imputation in practice

library(VIM)
nhanes_imp <- kNN(nhanes, k = 5, variable = c("TotChol", "Pulse"))

head(nhanes_imp)

  Age Gender Weight Height Diabetes TotChol Pulse PhysActive TotChol_imp Pulse_imp
1  16   male   73.2  172.0    FALSE    3.00    76       TRUE       FALSE     FALSE
2  17   male   72.3  176.0    FALSE    2.61    74       TRUE       FALSE     FALSE
3  12   male   57.7  158.9    FALSE    4.27    80       TRUE       FALSE     FALSE
4  16   male   88.9  183.3    FALSE    3.62    58       TRUE       FALSE     FALSE
5  13 female   45.1  157.6    FALSE    2.66    92       TRUE       FALSE     FALSE
6  16 female   48.7  158.4    FALSE    4.32    58      FALSE       FALSE     FALSE

Weighting donors

Out of the k chosen neighbors for an observation, some are more similar to it than others.
We might want to put more weight on closer neighbors when aggregating their values.
Aggregate neighbors with a weighted mean, with weights given by the inverted distances to each neighbor.
This is only possible for imputing numeric variables.

nhanes_imp <- nhanes %>% 
  kNN(variable = c("TotChol", "Pulse"),
      k = 5,
      numFun = weighted.mean,
      weightDist = TRUE)

Sorting variables

The kNN algorithm loops over variables, imputing them one by one.
Each time the distances between observations are calculated.
If the first variable had a lot of missing values, then the distance calculation for the second variable will be based on many imputed values.
It is good to sort the variables in ascending order by the number of missing values before running kNN.

Sorting variables in practice

vars_by_NAs <- nhanes %>% 
  is.na() %>%
  colSums() %>%
  sort(decreasing = FALSE) %>% 
  names()

nhanes_imp <- nhanes %>% 
  select(vars_by_NAs) %>% 
  kNN(k = 5)

Let's practice kNN imputation!

Handling Missing Data with Imputations in R