Handling missing data

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Missing data (an example)

 ... |       name       | score |               inspection_type                | ... 
 ----+------------------+-------+----------------------------------------------+-----
 ... | ...              | ...   | ...                                          | ...
 ... | SCHNIPPERS       | 27    | Cycle Inspection / Initial Inspection        | ...
 ... | ATOMIC WINGS     |       | Administrative Miscellaneous / Re-inspection | ...
 ... | WING LING        | 44    | Cycle Inspection / Initial Inspection        | ...
 ... | JUAN VALDEZ CAFE | 24    | Cycle Inspection / Initial Inspection        | ...
 ... | FULTON GRAND     | 22    | Cycle Inspection / Initial Inspection        | ...
 ... | ...              | ...   | ...                                          | ...

Representations for missing values:

  • NULL (general)
  • '' - empty string (used for string columns)
Cleaning Data in PostgreSQL Databases

Causes of missing data

What causes missing data? Black puzzle with missing piece

  • An image showing human brain in a head silhouette human error
  • An image of gears systematic issues
Cleaning Data in PostgreSQL Databases

Types of missing data

Three images of the different types of data with each category and abbreviation listed

Cleaning Data in PostgreSQL Databases

Types of missing data

Three images of the different types of data with each category and abbreviation listed with the description of Missing Completely at Random added

Cleaning Data in PostgreSQL Databases

Types of missing data

Three images of the different types of data with each category and abbreviation listed with the description of Missing at Random added

Cleaning Data in PostgreSQL Databases

Types of missing data

 ... |       name       | score |               inspection_type                | ... 
 ----+------------------+-------+----------------------------------------------+-----
 ... | ...              | ...   | ...                                          | ...
 ... | SCHNIPPERS       | 27    | Cycle Inspection / Initial Inspection        | ...
 ... | ATOMIC WINGS     |       | Administrative Miscellaneous / Re-inspection | ...
 ... | WING LING        | 44    | Cycle Inspection / Initial Inspection        | ...
 ... | JUAN VALDEZ CAFE | 24    | Cycle Inspection / Initial Inspection        | ...
 ... | FULTON GRAND     | 22    | Cycle Inspection / Initial Inspection        | ...
 ... | ...              | ...   | ...                                          | ...
Cleaning Data in PostgreSQL Databases

Types of missing data

Three images of the different types of data with each category and abbreviation listed with the description of Missing Not at Random added

Cleaning Data in PostgreSQL Databases

Identifying missing data

SELECT
  *
FROM
  restaurant_inspection
WHERE
  score IS NULL;
SELECT
  COUNT(*)
FROM
  restaurant_inspection
WHERE
  score IS NULL;
Cleaning Data in PostgreSQL Databases

Identifying missing data

SELECT
  inspection_type,
  COUNT(*) as count
FROM
  restaurant_inspection
WHERE
  score IS NULL
GROUP BY
  inspection_type
ORDER BY
  count DESC;
                  inspection_type                  | count 
 --------------------------------------------------+-------
 Administrative Miscellaneous / Initial Inspection |   104
 Smoke-Free Air Act / Initial Inspection           |    29
 Calorie Posting / Initial Inspection              |    22
 Administrative Miscellaneous / Re-inspection      |    22
 Trans Fat / Initial Inspection                    |    18
 Smoke-Free Air Act / Re-inspection                |     7
 Trans Fat / Re-inspection                         |     3
Cleaning Data in PostgreSQL Databases

Rectifying missing data

  • Best option: locate and add missing values

    • May not be feasible
    • May not be worthwhile
  • Provide a value (average, median, etc)

  • Exclude records

Cleaning Data in PostgreSQL Databases

Replacing missing values with COALESCE()

COALESCE(arg1, [arg2, ...])

SELECT
  name,
  COALESCE(score, -1),
  inspection_type
FROM
  restaurant_inspection;
Cleaning Data in PostgreSQL Databases

Replacing missing values with COALESCE()

 ... |       name       | score |               inspection_type                | ... 
 ----+------------------+-------+----------------------------------------------+-----
 ... | ...              | ...   | ...                                          | ...
 ... | SCHNIPPERS       | 27    | Cycle Inspection / Initial Inspection        | ...
 ... | ATOMIC WINGS     | -1    | Administrative Miscellaneous / Re-inspection | ...
 ... | WING LING        | 44    | Cycle Inspection / Initial Inspection        | ...
 ... | JUAN VALDEZ CAFE | 24    | Cycle Inspection / Initial Inspection        | ...
 ... | FULTON GRAND     | 22    | Cycle Inspection / Initial Inspection        | ...
 ... | ...              | ...   | ...                                          | ...
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...