Handling duplicated data

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Duplicate data

A small decorative image representing data

  • Database should not store duplicate records
  • Wastes storage resources
  • Potentially distorts analysis
Cleaning Data in PostgreSQL Databases

Detecting duplicated data

 camis    |              name               |   boro    | inspection_date | ... 
 ---------+---------------------------------+-----------+-----------------+-----
 ...      | ...                             | ...       | ...             | ...
 41659848 | LA BRISA DEL CIBAO              | Queens    | 01/30/2018      | ...
 40961447 | MESON SEVILLA RESTAURANT        | Manhattan | 03/19/2019      | ...
 50063071 | WA BAR                          | Manhattan | 05/23/2018      | ...
 50034992 | EMPANADAS MONUMENTAL            | Manhattan | 06/21/2019      | ...
 50095871 | ALPHONSO'S PIZZERIA & TRATTORIA | Manhattan | 01/16/2020      | ...
 ...      | ...                             | ...       | ...             | ...
Cleaning Data in PostgreSQL Databases

Detecting duplicated data

SELECT
  camis
FROM
  restaurant_inspection
GROUP BY
  camis
HAVING
  COUNT(*) > 1;
579
Cleaning Data in PostgreSQL Databases

Detecting duplicated data

SELECT
  camis,
  name,
  boro
FROM
  restaurant_inspection
GROUP BY
  camis, name, boro
HAVING
  COUNT(*) > 1;
579
SELECT
  camis,
  name,
  boro,
  inspection_date
FROM
  restaurant_inspection
GROUP BY
  camis, name, boro, inspection_date
HAVING
  COUNT(*) > 1;
83
Cleaning Data in PostgreSQL Databases

Detecting duplicated data

SELECT
  camis,
  name,
  boro,
  inspection_date,
  violation_code
FROM
  restaurant_inspection
GROUP BY
  camis, name, boro, inspection_date, violation_code
HAVING
  COUNT(*) > 1;
0
Cleaning Data in PostgreSQL Databases

Detecting duplicated data

 camis    |           name           |   boro    | inspection_date | violation_code | ... 
 ---------+--------------------------+-----------+-----------------+----------------+-----
 ...      | ...                      | ...       | ...             | ...            | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | ...
 40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019      | 10F            | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | ...
 ...      | ...                      | ...       | ...             | ...            | ...

Cleaning Data in PostgreSQL Databases

The ROW_NUMBER() function

ROW_NUMBER() OVER()

ROW_NUMBER() OVER(
  PARTITION BY 
      col1, col2, ...
  ORDER BY 
      colA, colB, ...
)
 camis    |           name           |   boro    | inspection_date | violation_code | row_number | ... 
 ---------+--------------------------+-----------+-----------------+----------------+------------+----
 ...      | ...                      | ...       | ...             | ...            | ...        | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 1          | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 2          | ...
 40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019      | 10F            | 1          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 1          | ...
 ...      | ...                      | ...       | ...             | ...            | ...        |...

Cleaning Data in PostgreSQL Databases

Enumerating duplicate rows

SELECT
  camis,
  name,
  boro,
  inspection_date,
  violation_code,
  ROW_NUMBER() OVER(
      PARTITION BY 
        camis, 
        name, 
        boro, 
        inspection_date, 
        violation_code
  ) - 1 AS duplicate
FROM
  restaurant_inspection;

A table of results with two groups of duplicated data for the camis, name, boro, inspection_date, violation_code, and duplicate columns

Cleaning Data in PostgreSQL Databases

Enumerating duplicate rows

SELECT
  camis, name, boro, inspection_date, violation_code,
  ROW_NUMBER() OVER(
      PARTITION BY camis, name, boro, inspection_date, violation_code
  ) - 1 AS duplicate
FROM
  restaurant_inspection;
 camis    |           name           |   boro    | inspection_date | violation_code | duplicate  | ...
 ---------+--------------------------+-----------+-----------------+----------------+------------+----
  ...     | ...                      | ...       | ...             | ...            | ...        | ...
 40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019      | 10F            | 0          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 0          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 1          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 2          | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 0          | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 1          | ...
  ...     | ...                      | ...       | ...             | ...            | ...        | ...
Cleaning Data in PostgreSQL Databases

Resolving impartial duplicates

Impartial duplicate - column values are duplicated with ambiguity where values differ

 camis   |      name       | inspection_date | violation_code | score | ... 
 --------+-----------------+-----------------+----------------+-------+-----
 ...      | ...             | ...            | ...            | ...   | ... 
 50038736 | DON NICO'S      | 03/29/2018     | 09B            | 26    | ...
 50038736 | DON NICO'S      | 03/29/2018     | 09B            | 18    | ...
 50033304 | ASTORIA PIZZA   | 12/18/2019     | 02B            | 16    | ...
 50081658 | IRVING FARMS    | 12/13/2018     | 06F            | 9     | ...
 50033733 | ICHIBANTEI      | 02/12/2019     | 10B            | 12    | ...
 ...      | ...             | ...            | ...            | ...   | ...
Cleaning Data in PostgreSQL Databases

Resolving impartial duplicates

Compute replacement from aggregate function (AVERAGE(), MIN(), MAX(), etc.)

SELECT 
    camis, 
    name, 
    inspection_date, 
    violation_code, 
    AVG(score) AS score
FROM 
    restaurant_inspection 
GROUP BY 
    camis, 
    name, 
    inspection_date, 
    violation_code 
HAVING 
    COUNT(*) > 1;
Cleaning Data in PostgreSQL Databases

Resolving impartial duplicates

 camis    |      name       | inspection_date | violation_code | score | ... 
 ---------+-----------------+-----------------+----------------+-------+-----
 ...      | ...             | ...             | ...            | ...   | ... 
 50038736 | DON NICO'S      | 03/29/2018      | 09B            | 22.0  | ...
 ...      | ...             | ...             | ...            | ...   | ... 
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...