Cleaning Data in PostgreSQL Databases
Darryl Reeves, Ph.D.
Industry Assistant Professor, New York University
121 Fontainebleau Drive
Fountainbleau
, Fontainbleu
, Fontainblue
Fountainbleau
, Fontainbleu
, Fontainblue
→ F535
fuzzystrmatch
moduleCREATE EXTENSION fuzzystrmatch;
SOUNDEX(input_string)
→ 4 character code
SELECT
SOUNDEX('Fountainbleau') as sd1,
SOUNDEX('Fontainebleau') as sd2,
SOUNDEX('Fontaineblue') as sd3;
sd1 | sd2 | sd3
-----+------+------
F535 | F535 | F535
DIFFERENCE(string1, string2)
→ 0, 1, 2, 3, or 4
SELECT SOUNDEX('pair') AS sd_pair, SOUNDEX('pear') AS sd_pear;
sd_pair | sd_pear
--------+---------
P600 | P600
SELECT DIFFERENCE('pair', 'pear') AS diff;
diff
----
4
SELECT SOUNDEX('bow') AS sd_bow, SOUNDEX('bough') AS sd_bough;
sd_bout | sd_bought
--------+-----------
B300 | B230
SELECT DIFFERENCE('bout', 'bought') AS diff
diff
----
2
name | boro | building | street
-----------------------+-----------+----------+------------------------------
ATOMIC WINGS | Manhattan | 2090 | FREDERICK DOUGLASS BOULEVARD
BARAKA BUFFET | Manhattan | 2546 | FREDERICK DOUGLASS BOULEVARD
CHOCOLAT | Manhattan | 2217 | FREDERICK DOUGLASS BOULEVARD
ESO | Manhattan | 2906 | FREDERICK DOUGLASS BOULEVARD
HOP HOUSE HARLEM | Manhattan | 2224 | FREDERICK DOUGLASS BOULEVARD
HOT POT UNDER DE' TREE | Manhattan | 2839 | FREDERICK DOUGLAS BOULEVARD
LIDO | Manhattan | 2168 | FREDERICK DOUGLAS BOULEVARD
MESS HALL | Manhattan | 2194 | FRDRCK DGLS BLVD
VINATERIA | Manhattan | 2211 | FREDERICK DOUGLAS BOULEVARD
name | boro | building | street | sd_street
-----------------------+-----------+----------+------------------------------+-----------
ATOMIC WINGS | Manhattan | 2090 | FREDERICK DOUGLASS BOULEVARD | F636
BARAKA BUFFET | Manhattan | 2546 | FREDERICK DOUGLASS BOULEVARD | F636
CHOCOLAT | Manhattan | 2217 | FREDERICK DOUGLASS BOULEVARD | F636
ESO | Manhattan | 2906 | FREDERICK DOUGLASS BOULEVARD | F636
HOP HOUSE HARLEM | Manhattan | 2224 | FREDERICK DOUGLASS BOULEVARD | F636
HOT POT UNDER DE' TREE | Manhattan | 2839 | FREDERICK DOUGLAS BOULEVARD | F636
LIDO | Manhattan | 2168 | FREDERICK DOUGLAS BOULEVARD | F636
MESS HALL | Manhattan | 2194 | FRDRCK DGLS BLVD | F636
VINATERIA | Manhattan | 2211 | FREDERICK DOUGLAS BOULEVARD | F636
SELECT
name, boro, building, street
FROM
restaurant_inspections
WHERE
DIFFERENCE(street, 'Frederick Douglass Boulevard') = 4;
UPDATE
table_name
SET
column_name = value
WHERE
condition
UPDATE
restaurant_inspection
SET
street = 'Frederick Douglass Boulevard'
WHERE
DIFFERENCE(street, 'Frederick Douglass Boulevard') = 4;
UPDATE 10
Cleaning Data in PostgreSQL Databases