k-Nearest-Neighbors imputation

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

k-Nearest-Neighbors imputation

A table with three columns: A, B and C. There is one missing value in column A.

Handling Missing Data with Imputations in R

k-Nearest-Neighbors imputation

A table with three columns: A, B and C. There is one missing value in column A. Three of the rows not containing the missing value are highlighted in color.

For each observation with missing values:

  1. Find other k observations (donors, neighbors) that are most similar to that observation.
Handling Missing Data with Imputations in R

k-Nearest-Neighbors imputation

A table with three columns: A, B and C. Three of the rows not containing the missing value are highlighted in color. The previously missing value in A has been replaced with the mean of the numbers in the same column in the highlighted rows.

For each observation with missing values:

  1. Find other k observations (donors, neighbors) that are most similar to that observation.
  2. Replace the missing values with aggregated values from the k donors (mean, median, mode).
Handling Missing Data with Imputations in R

Distance measures

The distance between two observations a and b:

Euclidean distance for n numeric variables:

$\sqrt{\Sigma_{i=1}^{n} (a_i - b_i)^{2}}$

Manhattan distance for f factor variables:

$\Sigma_{i=1}^{f} |a_i - b_i|$

Hamming distance for c categorical variables:

$\Sigma_{i=1}^{c} I(a_i \neq b_i)$

A coordinate system with two points connected with a straight line.

A coordinate system with two points connected with two perpendicular lines, as though the points were two opposite corners of a rectangle.

Handling Missing Data with Imputations in R

Gower distance

A mocked data frame with containing three types of variables, each highlighted with a different color: numeric, factor and categorical variables.

Handling Missing Data with Imputations in R

Gower distance

A mocked data frame with containing three types of variables, each highlighted with a different color: numeric, factor and categorical variables. Each variable type has an arrow assigned, which points the the corresponding distance measure: Euclidean, Manhattan and Hamming distance, respectively. The three distance measures point an ellipse with the Gower distance, which is the combination of the three.

Handling Missing Data with Imputations in R

kNN imputation in practice

library(VIM)
nhanes_imp <- kNN(nhanes, k = 5, variable = c("TotChol", "Pulse"))
head(nhanes_imp)
  Age Gender Weight Height Diabetes TotChol Pulse PhysActive TotChol_imp Pulse_imp
1  16   male   73.2  172.0    FALSE    3.00    76       TRUE       FALSE     FALSE
2  17   male   72.3  176.0    FALSE    2.61    74       TRUE       FALSE     FALSE
3  12   male   57.7  158.9    FALSE    4.27    80       TRUE       FALSE     FALSE
4  16   male   88.9  183.3    FALSE    3.62    58       TRUE       FALSE     FALSE
5  13 female   45.1  157.6    FALSE    2.66    92       TRUE       FALSE     FALSE
6  16 female   48.7  158.4    FALSE    4.32    58      FALSE       FALSE     FALSE
Handling Missing Data with Imputations in R

Weighting donors

  • Out of the k chosen neighbors for an observation, some are more similar to it than others.
  • We might want to put more weight on closer neighbors when aggregating their values.
  • Aggregate neighbors with a weighted mean, with weights given by the inverted distances to each neighbor.
  • This is only possible for imputing numeric variables.
nhanes_imp <- nhanes %>% 
  kNN(variable = c("TotChol", "Pulse"),
      k = 5,
      numFun = weighted.mean,
      weightDist = TRUE)
Handling Missing Data with Imputations in R

Sorting variables

  • The kNN algorithm loops over variables, imputing them one by one.
  • Each time the distances between observations are calculated.
  • If the first variable had a lot of missing values, then the distance calculation for the second variable will be based on many imputed values.
  • It is good to sort the variables in ascending order by the number of missing values before running kNN.
Handling Missing Data with Imputations in R

Sorting variables in practice

vars_by_NAs <- nhanes %>% 
  is.na() %>%
  colSums() %>%
  sort(decreasing = FALSE) %>% 
  names()
nhanes_imp <- nhanes %>% 
  select(vars_by_NAs) %>% 
  kNN(k = 5)
Handling Missing Data with Imputations in R

Let's practice kNN imputation!

Handling Missing Data with Imputations in R

Preparing Video For Download...