Detecting invalid values

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Assistant Professor, Long Island University - Brooklyn

Invalid data values

  camis   |              name               | inspection_date | score |            inspection_type            | ... 
 ---------+---------------------------------+-----------------+-------+---------------------------------------+-----
 ...      | ...                             | ...             | ...   | ...                                   | ...
 41659848 | LA BRISA DEL CIBAO              | 01/30/2018      | 20    | Cycle Inspection / Initial Inspection | ...
 40961447 | MESON SEVILLA RESTAURANT        | 03/19/2019      | 50    | Cycle Inspection / Initial Inspection | ...
 50063071 | WA BAR                          | 05/23/2018      | 15    | Cycle Inspection / Initial Inspection | ...
 50034992 | EMPANADAS MONUMENTAL            | 06/21/2019      | 17    | Cycle Inspection / Re-inspection      | ...
 50095871 | ALPHONSO'S PIZZERIA & TRATTORIA | 01/16/2020      | 10    | Cycle Inspection / Initial Inspection | ...
 ...      | ...                             | ...             | ...   | ...                                   | ...
  camis   |             name             | inspection_date | score |                 inspection_type                 | ... 
 ---------+------------------------------+-----------------+-------+-------------------------------------------------+-----
 ...      | ...                          | ...             | ...   | ...                                             | ...
 41104041 | THE SPARROW TAVERN           | 09/17/2019      | 13    | Cycle Inspection / Initial Inspection           | ...
 50016937 | BURGER KING                  | 09/14/2018      | 12    | Cycle Inspection / Re-inspection                | ...
 50066469 | DARBAR'S CHICKEN & RIBS      | 08/07/2017      | 11    | Pre-permit (Operational) / Reopening Inspection | ...
 41195691 | F & J PINE RESTAURANT        | 05/02/2019      | 26    | Cycle Inspection / Initial Inspection           | ...
 50015706 | EL RINCONCITO DE LOS SABORES | 12/18/2019      | A     | Cycle Inspection / Initial Inspection           | ...
 ...      | ...                          | ...             | ...   | ...                                             | ...
Cleaning Data in PostgreSQL Databases

Handling invalid data with pattern matching

SELECT
  camis,
  name,
  inspection_date,
  score
FROM
  restaurant_inspection
WHERE
  score NOT SIMILAR TO '\d+';
Cleaning Data in PostgreSQL Databases

Handling invalid data with pattern matching

  • Query only restricts non-digit characters
  • No restriction on length of value
SELECT
  camis,
  name,
  inspection_date,
  score
FROM
  restaurant_inspection
WHERE
  score NOT SIMILAR TO '\d{1}' AND
  score NOT SIMILAR TO '\d{2}' AND
  score NOT SIMILAR TO '\d{3}';
Cleaning Data in PostgreSQL Databases

Using type constraints

  • Column contains integer values
  • Column should not allow non-integers
ALTER TABLE restaurant_inspection
ALTER COLUMN score TYPE SMALLINT USING score::smallint;
  • SMALLINT: values from -32,768 to 32,767
  • USING clause specifies conversion of previous values
Cleaning Data in PostgreSQL Databases

Review: Basics of Regular Expressions

Metacharacter Usage Example RE Example Match
\d matches a digit (0-9) \d\d\d '345'
? matches 0 or 1 of previous character x\d? 'x5'
+ matches one or more of previous character \d+ '10'
* matches any character 0 or more times \d* '3081'
[] matches any character inside of the brackets [a-z] 'f'
Cleaning Data in PostgreSQL Databases

Type constraints enable range constraints

ALTER TABLE restaurant_inspection
ALTER COLUMN score TYPE SMALLINT USING score::smallint;

SELECT
  camis,
  name,
  inspection_date,
  score
FROM
  restaurant_inspection
WHERE
  score < 0;
Cleaning Data in PostgreSQL Databases

Type constraints enable range constraints

ALTER TABLE restaurant_inspection
ALTER COLUMN score TYPE SMALLINT USING score::smallint;

SELECT
  camis,
  name,
  inspection_date,
  score
FROM
  restaurant_inspection
WHERE
  score <= -1;
Cleaning Data in PostgreSQL Databases

Type constraints enable range constraints

ALTER TABLE restaurant_inspection
ALTER COLUMN score TYPE SMALLINT USING score::smallint;

SELECT
  camis,
  name,
  inspection_date,
  score
FROM
  restaurant_inspection
WHERE
  score < 0 OR
  score > 100;
Cleaning Data in PostgreSQL Databases

Type constraints enable range constraints

ALTER TABLE restaurant_inspection
ALTER COLUMN score TYPE SMALLINT USING score::smallint;

SELECT
  camis,
  name,
  inspection_date,
  score
FROM
  restaurant_inspection
WHERE
  score < 0 OR
  score >= 101;
Cleaning Data in PostgreSQL Databases

The BETWEEN operator

SELECT
  camis, name, inspection_date, score
FROM
  restaurant_inspection
WHERE
  score NOT BETWEEN 0 AND 100;
  camis   |          name           | inspection_date | score 
 ---------+-------------------------+-----------------+-------
 ...      | ...                     | ...             |   ...
 41702543 | TROPICAL GRILL          | 05/14/2018      |   109
 50074058 | PAD THAI                | 08/01/2018      |   101
 50085349 | DON CHILE MEXICAN GRILL | 12/04/2018      |   124
 50092932 | ENERGY JUICE BAR        | 06/24/2019      |   102
 41702543 | TROPICAL GRILL          | 05/14/2018      |   109
 50034653 | KAI FAN ASIAN CUISINE   | 12/06/2019      |    -1
 ...      | ...                     | ...             |   ...
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...