Detecting inconsistent data

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Inconsistent data

  • Certain restaurant inspection rules
  • score corresponds to number of violations 2_4_hazard
    camis    |              name               |  score | ... 
    ---------+---------------------------------+--------+-----
    ...      | ...                             | ...    | ...                                   
    41659848 | LA BRISA DEL CIBAO              | 20     | ...
    40961447 | MESON SEVILLA RESTAURANT        | 50     | ...
    50063071 | WA BAR                          | 15     | ...
    ...      | ...                             | ...    | ...
    
  • A (0 to 13), B (14 to 27), C (28+)
  • Scenarios for grades:
    • A on initial inspection
    • Re-inspection with A, B, or C
1 https://www1.nyc.gov/assets/doh/downloads/pdf/rii/restaurant-grading-faq.pdf
Cleaning Data in PostgreSQL Databases

Checking rules with SQL

  • Interdependent can introduce inconsistency
  • Rules can be encoded in SQL
  • A given for score from 0 to 13
SELECT 
    camis, 
    grade, 
    grade_date, 
    score, 
    inspection_type 
FROM 
    restaurant_inspection 
WHERE 
    grade = 'A' AND 
    score NOT BETWEEN 0 AND 13;
0
Cleaning Data in PostgreSQL Databases

Checking rules with SQL

  • B given for score from 14 to 27
SELECT 
    camis, 
    grade, 
    grade_date, 
    score, 
    inspection_type 
FROM 
    restaurant_inspection 
WHERE 
    grade = 'B' AND 
    score NOT BETWEEN 14 AND 27;
 camis    | grade | grade_date | score |         inspection_type          
 ---------+-------+------------+-------+----------------------------------
 50034653 | B     | 12/06/2019 | -1    | Cycle Inspection / Re-inspection
Cleaning Data in PostgreSQL Databases

Checking rules with SQL

SELECT 
    camis, grade, grade_date, score, inspection_type FROM 
    restaurant_inspection 
WHERE 
    (grade = 'A' OR grade = 'B' OR grade = 'C') AND
    inspection_type LIKE '%Reopening%';
 camis    | grade | grade_date | score |                 inspection_type                 | ... 
 ---------+-------+------------+-------+-------------------------------------------------+-----
 ...      | ...   | ...        | ...   | ...                                             | ...
 50005784 | C     | 05/29/2019 | 14    | Cycle Inspection / Reopening Inspection         | ...
 50091190 | C     | 07/12/2019 | 7     | Pre-permit (Operational) / Reopening Inspection | ...
 40395023 | C     | 09/13/2019 | 8     | Cycle Inspection / Reopening Inspection         | ...
 50037770 | C     | 10/26/2018 | 11    | Cycle Inspection / Reopening Inspection         | ...
 50036406 | C     | 07/10/2018 | 20    | Cycle Inspection / Reopening Inspection         | ...
 ...      | ...   | ...        | ...   | ...                                             | ...
1 https://www1.nyc.gov/assets/doh/downloads/pdf/rii/restaurant-grading-faq.pdf
Cleaning Data in PostgreSQL Databases

Data cleaning insights

  • Diversity of approaches
  • Careful thought required
  • Domain knowledge is key
    • Which values are valid
    • Reasons for duplication
    • Appropriate fill-in values
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...