Cleaning Data in PostgreSQL Databases
Darryl Reeves, Ph.D.
Industry Assistant Professor, New York University
camis | name | boro | inspection_date | ...
---------+---------------------------------+-----------+-----------------+-----
... | ... | ... | ... | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | ...
40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019 | ...
50063071 | WA BAR | Manhattan | 05/23/2018 | ...
50034992 | EMPANADAS MONUMENTAL | Manhattan | 06/21/2019 | ...
50095871 | ALPHONSO'S PIZZERIA & TRATTORIA | Manhattan | 01/16/2020 | ...
... | ... | ... | ... | ...
SELECT
camis
FROM
restaurant_inspection
GROUP BY
camis
HAVING
COUNT(*) > 1;
579
SELECT
camis,
name,
boro
FROM
restaurant_inspection
GROUP BY
camis, name, boro
HAVING
COUNT(*) > 1;
579
SELECT
camis,
name,
boro,
inspection_date
FROM
restaurant_inspection
GROUP BY
camis, name, boro, inspection_date
HAVING
COUNT(*) > 1;
83
SELECT
camis,
name,
boro,
inspection_date,
violation_code
FROM
restaurant_inspection
GROUP BY
camis, name, boro, inspection_date, violation_code
HAVING
COUNT(*) > 1;
0
camis | name | boro | inspection_date | violation_code | ...
---------+--------------------------+-----------+-----------------+----------------+-----
... | ... | ... | ... | ... | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | 04L | ...
40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019 | 10F | ...
41630358 | FAY DA BAKERY | Queens | 03/07/2019 | 06E | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | 04L | ...
... | ... | ... | ... | ... | ...
ROW_NUMBER() OVER()
ROW_NUMBER() OVER(
PARTITION BY
col1, col2, ...
ORDER BY
colA, colB, ...
)
camis | name | boro | inspection_date | violation_code | row_number | ...
---------+--------------------------+-----------+-----------------+----------------+------------+----
... | ... | ... | ... | ... | ... | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | 04L | 1 | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | 04L | 2 | ...
40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019 | 10F | 1 | ...
41630358 | FAY DA BAKERY | Queens | 03/07/2019 | 06E | 1 | ...
... | ... | ... | ... | ... | ... |...
SELECT
camis,
name,
boro,
inspection_date,
violation_code,
ROW_NUMBER() OVER(
PARTITION BY
camis,
name,
boro,
inspection_date,
violation_code
) - 1 AS duplicate
FROM
restaurant_inspection;
SELECT
camis, name, boro, inspection_date, violation_code,
ROW_NUMBER() OVER(
PARTITION BY camis, name, boro, inspection_date, violation_code
) - 1 AS duplicate
FROM
restaurant_inspection;
camis | name | boro | inspection_date | violation_code | duplicate | ...
---------+--------------------------+-----------+-----------------+----------------+------------+----
... | ... | ... | ... | ... | ... | ...
40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019 | 10F | 0 | ...
41630358 | FAY DA BAKERY | Queens | 03/07/2019 | 06E | 0 | ...
41630358 | FAY DA BAKERY | Queens | 03/07/2019 | 06E | 1 | ...
41630358 | FAY DA BAKERY | Queens | 03/07/2019 | 06E | 2 | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | 04L | 0 | ...
41659848 | LA BRISA DEL CIBAO | Queens | 01/30/2018 | 04L | 1 | ...
... | ... | ... | ... | ... | ... | ...
Impartial duplicate - column values are duplicated with ambiguity where values differ
camis | name | inspection_date | violation_code | score | ...
--------+-----------------+-----------------+----------------+-------+-----
... | ... | ... | ... | ... | ...
50038736 | DON NICO'S | 03/29/2018 | 09B | 26 | ...
50038736 | DON NICO'S | 03/29/2018 | 09B | 18 | ...
50033304 | ASTORIA PIZZA | 12/18/2019 | 02B | 16 | ...
50081658 | IRVING FARMS | 12/13/2018 | 06F | 9 | ...
50033733 | ICHIBANTEI | 02/12/2019 | 10B | 12 | ...
... | ... | ... | ... | ... | ...
Compute replacement from aggregate function (AVERAGE()
, MIN()
, MAX()
, etc.)
SELECT
camis,
name,
inspection_date,
violation_code,
AVG(score) AS score
FROM
restaurant_inspection
GROUP BY
camis,
name,
inspection_date,
violation_code
HAVING
COUNT(*) > 1;
camis | name | inspection_date | violation_code | score | ...
---------+-----------------+-----------------+----------------+-------+-----
... | ... | ... | ... | ... | ...
50038736 | DON NICO'S | 03/29/2018 | 09B | 22.0 | ...
... | ... | ... | ... | ... | ...
Cleaning Data in PostgreSQL Databases