Imputing missing values

Manipulating Time Series Data in R

Harrison Brown

Graduate Researcher in Geography

Regular and irregular time series

Regular time series:

  • No missing or NA values
  • Even spacing between intervals

Real-world issues:

  • Sensor and equipment failure
  • Weather conditions
  • ...

Aggregation:

  • Resamples data to lower temporal resolution
  • Reduces information
  • e.g. monthly sum of daily values

Imputation:

  • Fills in missing values
  • Different methods for determining values
Manipulating Time Series Data in R

Imputation

Graph of a portion of the Mauna Loa dataset. This version of the graph has missing points in the data, which leads to 'holes' or gaps in the line on the plot. These holes represent NA values in the data.

This graph is a zoomed-in version of the previous plot of the Mauna Loa dataset. The gaps in the line plot are clearer.

Manipulating Time Series Data in R

Imputing values with zoo

na-dot functions from zoo:

  • zoo::na.fill()

  • zoo::na.locf()

  • zoo::na.approx()

Manipulating Time Series Data in R

Determining missing values

observations
2017-01-01 NA
2017-01-02  2
2017-01-03  2
2017-01-04  2
2017-01-05  4
2017-01-06  2
2017-01-07 NA
2017-01-08  1
2017-01-09  2
2017-01-10  2
...
sum(is.na(observations))
[1] 23
Manipulating Time Series Data in R

na.fill

observations
2017-01-01 NA
2017-01-02  2
2017-01-03  2
2017-01-04  2
2017-01-05  4
...
table(observations, useNA = 'ifany')
   1    2    3    4    5    6 <NA> 
  43   29   17   13    1    2   23

A graph of a fictional time series where the y-axis represents the number of 'observations' each day. There are gaps between observations, which are caused by missing, NA values in the dataset.

Manipulating Time Series Data in R

na.fill

observations_fill <-
  na.fill(object = observations,
          fill = 0)

table(observations_fill)
 0  1  2  3  4  5  6 
23 43 29 17 13  1  2
autoplot(observations_fill)

This graph is a 'filled-in' version of the 'Daily observations' graph - the missing, NA values have been replaced with zeros by using the Constant Fill imputation method.

Manipulating Time Series Data in R

na.locf

autoplot(scores)
Warning message:
Removed 12 row(s) containing
missing values (geom_path).

Graph of the fictional 'Monthly Test Scores' time series. The line of the graph ends abruptly, indicating that after a certain point, there are missing, NA values in the dataset. The graph depicts fictional test scores from 2005 to 2008, and there is a general upward trend in the data. After 2007, the values in the dataset are missing.

scores_locf <- na.locf(scores)
autoplot(scores_locf)

'Filled-in' version of the 'Monthly Test Scores' graph, using LOCF, or Last Observation Carried Forward. After 2007, the values in the time series are replaced with the most recent non-NA value.

Manipulating Time Series Data in R

Linear interpolation

na.approx()

Slide 1 in an 'animation' showing the conceptual process of Linear Interpolation. The image depicts a conceptual graph, with x and y axes, and values plotted on a line. There is a large 'gap' between a set of values on the left and a set of values on the right, where the values of the data are NA.

Manipulating Time Series Data in R

Linear interpolation

na.approx()

Slide 2 in an 'animation' showing the conceptual process of Linear Interpolation. The gap between the two halves of the data is bridged by a dotted red line, indicating that linear interpolation connects the two closest non-NA values on either side of the missing data.

Manipulating Time Series Data in R

Linear interpolation

na.approx()

Slide 3 in an 'animation' showing the conceptual process of Linear Interpolation. The red, dotted line is replaced by a solid black line, indicating that the missing values have been replaced by linear interpolation.

Manipulating Time Series Data in R

na.approx

maunaloa_approx <-
  na.approx(maunaloa_missing)

autoplot(maunaloa_approx) +
  labs(
    x = "Index",
    y = "CO2 Concentration",
    title = "Approximated Data Points"
  )

Plot of the Mauna Loa time series, where the missing, NA values have been 'filled-in' with linear interpolation. It is almost impossible to determine where the missing values used to be, which indicates the accuracy of the linear interpolation method of imputing missing values.

Manipulating Time Series Data in R

Let's practice!

Manipulating Time Series Data in R

Preparing Video For Download...