Matching similar strings

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Similar strings (example)

  • Delivery address: 121 Fontainebleau Drive
  • Fountainbleau, Fontainbleu, Fontainblue
Cleaning Data in PostgreSQL Databases

The Soundex algorithm

  • Words represented by sound
  • Encodes words using 4 characters
  • Fountainbleau, Fontainbleu, FontainblueF535
Cleaning Data in PostgreSQL Databases

SOUNDEX() in PostgreSQL

  • Available through fuzzystrmatch module
CREATE EXTENSION fuzzystrmatch;

SOUNDEX(input_string) → 4 character code

SELECT
  SOUNDEX('Fountainbleau') as sd1,
  SOUNDEX('Fontainebleau') as sd2,
  SOUNDEX('Fontaineblue') as sd3;
 sd1  | sd2  | sd3  
 -----+------+------
 F535 | F535 | F535
Cleaning Data in PostgreSQL Databases

The DIFFERENCE() function

DIFFERENCE(string1, string2) → 0, 1, 2, 3, or 4

SELECT SOUNDEX('pair') AS sd_pair, SOUNDEX('pear') AS sd_pear;
 sd_pair | sd_pear 
 --------+---------
 P600    | P600
SELECT DIFFERENCE('pair', 'pear') AS diff;
 diff 
 ----
    4
Cleaning Data in PostgreSQL Databases

The DIFFERENCE() function

SELECT SOUNDEX('bow') AS sd_bow, SOUNDEX('bough') AS sd_bough;
 sd_bout | sd_bought 
 --------+-----------
 B300    | B230
SELECT DIFFERENCE('bout', 'bought') AS diff
 diff 
 ----
    2
Cleaning Data in PostgreSQL Databases

Using DIFFERENCE()

 name                   |   boro    | building |            street            
 -----------------------+-----------+----------+------------------------------
 ATOMIC WINGS           | Manhattan | 2090     | FREDERICK DOUGLASS BOULEVARD
 BARAKA BUFFET          | Manhattan | 2546     | FREDERICK DOUGLASS BOULEVARD
 CHOCOLAT               | Manhattan | 2217     | FREDERICK DOUGLASS BOULEVARD
 ESO                    | Manhattan | 2906     | FREDERICK DOUGLASS BOULEVARD
 HOP HOUSE HARLEM       | Manhattan | 2224     | FREDERICK DOUGLASS BOULEVARD
 HOT POT UNDER DE' TREE | Manhattan | 2839     | FREDERICK DOUGLAS BOULEVARD
 LIDO                   | Manhattan | 2168     | FREDERICK DOUGLAS BOULEVARD
 MESS HALL              | Manhattan | 2194     | FRDRCK DGLS BLVD
 VINATERIA              | Manhattan | 2211     | FREDERICK DOUGLAS BOULEVARD
Cleaning Data in PostgreSQL Databases

Using DIFFERENCE()

name                    |   boro    | building |            street            | sd_street 
 -----------------------+-----------+----------+------------------------------+-----------
 ATOMIC WINGS           | Manhattan | 2090     | FREDERICK DOUGLASS BOULEVARD | F636
 BARAKA BUFFET          | Manhattan | 2546     | FREDERICK DOUGLASS BOULEVARD | F636
 CHOCOLAT               | Manhattan | 2217     | FREDERICK DOUGLASS BOULEVARD | F636
 ESO                    | Manhattan | 2906     | FREDERICK DOUGLASS BOULEVARD | F636
 HOP HOUSE HARLEM       | Manhattan | 2224     | FREDERICK DOUGLASS BOULEVARD | F636
 HOT POT UNDER DE' TREE | Manhattan | 2839     | FREDERICK DOUGLAS BOULEVARD  | F636
 LIDO                   | Manhattan | 2168     | FREDERICK DOUGLAS BOULEVARD  | F636
 MESS HALL              | Manhattan | 2194     | FRDRCK DGLS BLVD             | F636
 VINATERIA              | Manhattan | 2211     | FREDERICK DOUGLAS BOULEVARD  | F636
SELECT
  name, boro, building, street
FROM
  restaurant_inspections
WHERE
  DIFFERENCE(street, 'Frederick Douglass Boulevard') = 4;
Cleaning Data in PostgreSQL Databases

Updating the recordings

UPDATE 
    table_name
SET 
    column_name = value
WHERE
    condition
UPDATE 
    restaurant_inspection 
SET 
    street = 'Frederick Douglass Boulevard' 
WHERE 
    DIFFERENCE(street, 'Frederick Douglass Boulevard') = 4;
UPDATE 10
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...