Hot-deck imputation

Handling Missing Data with Imputations in R

Michal Oleszak

Machine Learning Engineer

Hot-deck's history

  • Dates back to 1950s, when data was stored on punched cards.
  • Browsing through the data back and forth was very slow.
  • U.S. Census Bureau came up with a method requiring only one pass through the data.

A picture of an old punched card used for programming computers.

1 Image source: https://en.wikipedia.org/wiki/Punched_card#/media/File:Used_Punchcard_(5151286161).jpg
Handling Missing Data with Imputations in R

Hot-deck imputation

  • For each variable, replace every missing value with the last observed value.
  • Hot-deck refers to the deck of punched cards actually being processed.

Cons

  • Requires data to be MCAR.
  • Vanilla hot-deck can destroy relations between variables.

Pros

  • Fast (only one pass through data).
  • Imputed data are not constant.
  • Simple tricks prevent breaking relations.
Handling Missing Data with Imputations in R

Hot-deck imputation in practice

nhanes_imp <- hotdeck(nhanes, variable = c("Height", "Weight"))
head(nhanes_imp)
  Age Gender Weight Height Diabetes TotChol Pulse PhysActive Height_imp Weight_imp
1  16   male   73.2  172.0    FALSE    3.00    76       TRUE      FALSE      FALSE
2  17   male   72.3  176.0    FALSE    2.61    74       TRUE      FALSE      FALSE
3  12   male   57.7  158.9    FALSE    4.27    80       TRUE      FALSE      FALSE
4  16   male   88.9  183.3    FALSE    3.62    58       TRUE      FALSE      FALSE
5  13 female   45.1  157.6    FALSE    2.66    92       TRUE      FALSE      FALSE
6  16 female   48.7  180.7    FALSE    4.32    58      FALSE       TRUE      FALSE
Handling Missing Data with Imputations in R

Imputing within domains

A table with two columns: PhysActive and Weight. There is one missing value in Weight. Since the rows are in random order, a row with PhysActive=FALSE precedes the one with PhysActive=TRUE and the missing Weight. This shows how hot-deck feeds forward the weight value from a non-active person to an active person.

A table with two columns: PhysActive and Weight. There is one missing value in Weight. Since the data are grouped by PhysActive, hot-deck feeds forward the value of an active person the another active person.

nhanes_imp <- hotdeck(
    nhanes, 
    variable = "Weight", 
    domain_var = "PhysActive"
)
Handling Missing Data with Imputations in R

Sorting by correlated variables

A table with two variables: Height and Weight. There is one missing value in Weight. Since the rows are in a random order, a row with a large height precedes the one with a small height and the missing weight. This shows how hot-deck feeds forward the weight value from a tall person to a short person.

A table with two variables: Height and Weight. There is one missing value in Weight. Since the rows are sorted by Height, hot-deck feeds forward the weight value from a short person to another short person.

nhanes_imp <- hotdeck(
    nhanes, 
    variable = "Weight"
    ord_var = "Height"
)
Handling Missing Data with Imputations in R

Let's practice hot-deck-imputing!

Handling Missing Data with Imputations in R

Preparing Video For Download...