Omgaan met dubbele data

Data opschonen in PostgreSQL-databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Dubbele data

Een klein decoratief pictogram dat data weergeeft

  • Database moet geen dubbele records opslaan
  • Verspilt opslagruimte
  • Vertekent analyses
Data opschonen in PostgreSQL-databases

Dubbele data detecteren

 camis    |              name               |   boro    | inspection_date | ... 
 ---------+---------------------------------+-----------+-----------------+-----
 ...      | ...                             | ...       | ...             | ...
 41659848 | LA BRISA DEL CIBAO              | Queens    | 01/30/2018      | ...
 40961447 | MESON SEVILLA RESTAURANT        | Manhattan | 03/19/2019      | ...
 50063071 | WA BAR                          | Manhattan | 05/23/2018      | ...
 50034992 | EMPANADAS MONUMENTAL            | Manhattan | 06/21/2019      | ...
 50095871 | ALPHONSO'S PIZZERIA & TRATTORIA | Manhattan | 01/16/2020      | ...
 ...      | ...                             | ...       | ...             | ...
Data opschonen in PostgreSQL-databases

Dubbele data detecteren

SELECT
  camis
FROM
  restaurant_inspection
GROUP BY
  camis
HAVING
  COUNT(*) > 1;
579
Data opschonen in PostgreSQL-databases

Dubbele data detecteren

SELECT
  camis,
  name,
  boro
FROM
  restaurant_inspection
GROUP BY
  camis, name, boro
HAVING
  COUNT(*) > 1;
579
SELECT
  camis,
  name,
  boro,
  inspection_date
FROM
  restaurant_inspection
GROUP BY
  camis, name, boro, inspection_date
HAVING
  COUNT(*) > 1;
83
Data opschonen in PostgreSQL-databases

Dubbele data detecteren

SELECT
  camis,
  name,
  boro,
  inspection_date,
  violation_code
FROM
  restaurant_inspection
GROUP BY
  camis, name, boro, inspection_date, violation_code
HAVING
  COUNT(*) > 1;
0
Data opschonen in PostgreSQL-databases

Dubbele data detecteren

 camis    |           name           |   boro    | inspection_date | violation_code | ... 
 ---------+--------------------------+-----------+-----------------+----------------+-----
 ...      | ...                      | ...       | ...             | ...            | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | ...
 40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019      | 10F            | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | ...
 ...      | ...                      | ...       | ...             | ...            | ...

Data opschonen in PostgreSQL-databases

De functie ROW_NUMBER()

ROW_NUMBER() OVER()

ROW_NUMBER() OVER(
  PARTITION BY 
      col1, col2, ...
  ORDER BY 
      colA, colB, ...
)
 camis    |           name           |   boro    | inspection_date | violation_code | row_number | ... 
 ---------+--------------------------+-----------+-----------------+----------------+------------+----
 ...      | ...                      | ...       | ...             | ...            | ...        | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 1          | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 2          | ...
 40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019      | 10F            | 1          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 1          | ...
 ...      | ...                      | ...       | ...             | ...            | ...        |...

Data opschonen in PostgreSQL-databases

Dubbelen nummeren

SELECT
  camis,
  name,
  boro,
  inspection_date,
  violation_code,
  ROW_NUMBER() OVER(
      PARTITION BY 
        camis, 
        name, 
        boro, 
        inspection_date, 
        violation_code
  ) - 1 AS duplicate
FROM
  restaurant_inspection;

Een tabel met resultaten met twee groepen dubbele data voor de kolommen camis, name, boro, inspection_date, violation_code en duplicate

Data opschonen in PostgreSQL-databases

Dubbelen nummeren

SELECT
  camis, name, boro, inspection_date, violation_code,
  ROW_NUMBER() OVER(
      PARTITION BY camis, name, boro, inspection_date, violation_code
  ) - 1 AS duplicate
FROM
  restaurant_inspection;
 camis    |           name           |   boro    | inspection_date | violation_code | duplicate  | ...
 ---------+--------------------------+-----------+-----------------+----------------+------------+----
  ...     | ...                      | ...       | ...             | ...            | ...        | ...
 40961447 | MESON SEVILLA RESTAURANT | Manhattan | 03/19/2019      | 10F            | 0          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 0          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 1          | ...
 41630358 | FAY DA BAKERY            | Queens    | 03/07/2019      | 06E            | 2          | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 0          | ...
 41659848 | LA BRISA DEL CIBAO       | Queens    | 01/30/2018      | 04L            | 1          | ...
  ...     | ...                      | ...       | ...             | ...            | ...        | ...
Data opschonen in PostgreSQL-databases

Impliciete dubbelen oplossen

Impliciete dubbelen - kolomwaarden zijn dubbel met onduidelijkheid door afwijkende waarden

 camis   |      name       | inspection_date | violation_code | score | ... 
 --------+-----------------+-----------------+----------------+-------+-----
 ...      | ...             | ...            | ...            | ...   | ... 
 50038736 | DON NICO'S      | 03/29/2018     | 09B            | 26    | ...
 50038736 | DON NICO'S      | 03/29/2018     | 09B            | 18    | ...
 50033304 | ASTORIA PIZZA   | 12/18/2019     | 02B            | 16    | ...
 50081658 | IRVING FARMS    | 12/13/2018     | 06F            | 9     | ...
 50033733 | ICHIBANTEI      | 02/12/2019     | 10B            | 12    | ...
 ...      | ...             | ...            | ...            | ...   | ...
Data opschonen in PostgreSQL-databases

Impliciete dubbelen oplossen

Vervanging berekenen met aggregaatfunctie (AVERAGE(), MIN(), MAX(), etc.)

SELECT 
    camis, 
    name, 
    inspection_date, 
    violation_code, 
    AVG(score) AS score
FROM 
    restaurant_inspection 
GROUP BY 
    camis, 
    name, 
    inspection_date, 
    violation_code 
HAVING 
    COUNT(*) > 1;
Data opschonen in PostgreSQL-databases

Impliciete dubbelen oplossen

 camis    |      name       | inspection_date | violation_code | score | ... 
 ---------+-----------------+-----------------+----------------+-------+-----
 ...      | ...             | ...             | ...            | ...   | ... 
 50038736 | DON NICO'S      | 03/29/2018      | 09B            | 22.0  | ...
 ...      | ...             | ...             | ...            | ...   | ... 
Data opschonen in PostgreSQL-databases

Laten we oefenen!

Data opschonen in PostgreSQL-databases

Preparing Video For Download...