Inconsistente data opsporen

Data opschonen in PostgreSQL-databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Inconsistente data

  • Bepaalde regels voor restaurantinspecties
  • score is het aantal overtredingen 2_4_hazard
    camis    |              name               |  score | ... 
    ---------+---------------------------------+--------+-----
    ...      | ...                             | ...    | ...                                   
    41659848 | LA BRISA DEL CIBAO              | 20     | ...
    40961447 | MESON SEVILLA RESTAURANT        | 50     | ...
    50063071 | WA BAR                          | 15     | ...
    ...      | ...                             | ...    | ...
    
  • A (0 t/m 13), B (14 t/m 27), C (28+)
  • Scenario’s voor grades:
    • A bij eerste inspectie
    • Herinspectie met A, B of C
1 https://www1.nyc.gov/assets/doh/downloads/pdf/rii/restaurant-grading-faq.pdf
Data opschonen in PostgreSQL-databases

Regels checken met SQL

  • Onderlinge afhankelijkheid kan inconsistentie geven
  • Regels kun je in SQL coderen
  • A bij score van 0 t/m 13
SELECT 
    camis, 
    grade, 
    grade_date, 
    score, 
    inspection_type 
FROM 
    restaurant_inspection 
WHERE 
    grade = 'A' AND 
    score NOT BETWEEN 0 AND 13;
0
Data opschonen in PostgreSQL-databases

Regels checken met SQL

  • B bij score van 14 t/m 27
SELECT 
    camis, 
    grade, 
    grade_date, 
    score, 
    inspection_type 
FROM 
    restaurant_inspection 
WHERE 
    grade = 'B' AND 
    score NOT BETWEEN 14 AND 27;
 camis    | grade | grade_date | score |         inspection_type          
 ---------+-------+------------+-------+----------------------------------
 50034653 | B     | 12/06/2019 | -1    | Cycle Inspection / Re-inspection
Data opschonen in PostgreSQL-databases

Regels checken met SQL

SELECT 
    camis, grade, grade_date, score, inspection_type FROM 
    restaurant_inspection 
WHERE 
    (grade = 'A' OR grade = 'B' OR grade = 'C') AND
    inspection_type LIKE '%Reopening%';
 camis    | grade | grade_date | score |                 inspection_type                 | ... 
 ---------+-------+------------+-------+-------------------------------------------------+-----
 ...      | ...   | ...        | ...   | ...                                             | ...
 50005784 | C     | 05/29/2019 | 14    | Cycle Inspection / Reopening Inspection         | ...
 50091190 | C     | 07/12/2019 | 7     | Pre-permit (Operational) / Reopening Inspection | ...
 40395023 | C     | 09/13/2019 | 8     | Cycle Inspection / Reopening Inspection         | ...
 50037770 | C     | 10/26/2018 | 11    | Cycle Inspection / Reopening Inspection         | ...
 50036406 | C     | 07/10/2018 | 20    | Cycle Inspection / Reopening Inspection         | ...
 ...      | ...   | ...        | ...   | ...                                             | ...
1 https://www1.nyc.gov/assets/doh/downloads/pdf/rii/restaurant-grading-faq.pdf
Data opschonen in PostgreSQL-databases

Inzichten bij datacleaning

  • Verschillende aanpakken
  • Zorgvuldige afweging nodig
  • Domeinkennis is cruciaal
    • Welke waarden zijn geldig
    • Redenen voor duplicaten
    • Passende invulwaarden
Data opschonen in PostgreSQL-databases

Laten we oefenen!

Data opschonen in PostgreSQL-databases

Preparing Video For Download...