Pattern matching

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Identifying patterns: an example

 camis    |              name               | inspection_date | score | nta  | ... 
 ---------+---------------------------------+-----------------+-------+------+-----
 ...      | ...                             | ...             | ...   | ...  | ...
 41659848 | LA BRISA DEL CIBAO              | 01/30/2018      | 20    | QN26 | ...
 40961447 | MESON SEVILLA RESTAURANT        | 03/19/2019      | 50    | MN15 | ...
 50063071 | WA BAR                          | 05/23/2018      | 15    | MN17 | ...
 50034992 | EMPANADAS MONUMENTAL            | 06/21/2019      | 17    | MN35 | ...
 50095871 | ALPHONSO'S PIZZERIA & TRATTORIA | 01/16/2020      | 10    | MN28 | ...
 41104041 | THE SPARROW TAVERN              | 09/17/2019      | 13    | QN72 | ...
 50016937 | BURGER KING                     | 09/14/2018      | 12    | QN55 | ...
 50066469 | DARBAR'S CHICKEN & RIBS         | 08/07/2017      | 11    | QN55 | ...
 41195691 | F & J PINE RESTAURANT           | 05/02/2019      | 26    | BX49 | ...
 50015706 | EL RINCONCITO DE LOS SABORES    | 12/18/2019      | 20    | QN35 | ...
 ...      | ...                             | ...             | ..    | ...  | ...
1 https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page
Cleaning Data in PostgreSQL Databases

Identifying patterns: an example

Valid Prefixes to NTA Code

  • MN - Manhattan
  • BK - Brooklyn
  • BX - Bronx
  • QN - Queens
  • SI - Staten Island
Cleaning Data in PostgreSQL Databases

Identifying patterns: an example

Table results with a red box highlighting an invalid NTA code

Cleaning Data in PostgreSQL Databases

The LIKE operator

SELECT * FROM restaurant_inspection WHERE nta LIKE 'QN544';
SELECT * FROM restaurant_inspection WHERE nta = 'QN544';
Cleaning Data in PostgreSQL Databases

The LIKE Operator

Pattern Matching Characters

  • % - matches any sequence of zero or more characters

  • _ (underscore) - matches a single character

SELECT 
* 
FROM 
restaurant_inspection 
WHERE nta LIKE 'QN%';

 

 

 

SELECT 
* 
FROM 
restaurant_inspection 
WHERE nta LIKE 'QN%' 
AND nta NOT LIKE 'QN__';
Cleaning Data in PostgreSQL Databases

Regular expressions (REs)

  • Pattern matching with LIKE is limited
  • More specific patterns can be useful
  • Regular Expressions (REs) enable more expressive pattern matching
Cleaning Data in PostgreSQL Databases

The SIMILAR TO operator

  • Provides additional pattern matching functionality
    SELECT 
      camis, name, inspection_date, score, nta 
    FROM 
      restaurant_inspection 
    WHERE nta SIMILAR TO 'QN%' AND nta NOT SIMILAR TO 'QN__';
    
Cleaning Data in PostgreSQL Databases

Basics of REs

Metacharacter Usage Example RE Example Match
\d matches a digit (0-9) \d\d\d '345'
? matches 0 or 1 of previous character x\d? 'x5'
+ matches one or more of previous character \d+ '10'
* matches any character 0 or more times \d* '3081'
[] matches any character inside of the brackets [a-z] 'f'
Cleaning Data in PostgreSQL Databases

Using REs with SIMILAR TO

SELECT 
    camis, name, inspection_date, score, nta 
FROM 
    restaurant_inspection 
WHERE nta SIMILAR TO 'QN%' AND nta NOT SIMILAR TO 'QN__';
SELECT 
    camis, name, inspection_date, score, nta 
FROM 
    restaurant_inspection 
WHERE 
    nta NOT SIMILAR TO '[A-Z][A-Z]\d\d';
 camis    |        name        | inspection_date | score | nta  
 ---------+--------------------+-----------------+-------+------
 41659848 | LA BRISA DEL CIBAO | 01/30/2018      | 20    | Q26
 41104041 | THE SPARROW TAVERN | 09/17/2019      | 13    | QN723
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...