Introduction to data cleaning

Cleaning Data in PostgreSQL Databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Why is cleaning data important?

  • Data is messy
  • Before it can be analyzed, often needs cleaning
  • Helpful to utilize column type constraints
  • Course focuses on when defensive approaches not available/applicable
Cleaning Data in PostgreSQL Databases

Cleaning string data

  • String data: abundant, flexible, and often messy
 name                            | grade |            inspection_type              | census_tract | ... 
 --------------------------------+-------+-----------------------------------------+--------------+-----
 ...                             | ...   | ...                                     | ...          | ...
 EMPANADAS MONUMENTAL            | B     | Cycle Inspection  /  Re-inspection      | 26900        | ...
 ALPHONSO'S PIZZERIA & TRATTORIA | A     | Cycle Inspection  /  Initial Inspection | 202          | ...
 THE SPARROW TAVERN              | A     | Cycle Inspection  /  Initial Inspection | 12500        | ...
 BURGER KING                     | A     | Cycle Inspection  /  Re-inspection      | 86400        | ...
 ASTORIA PIZZA                   | B     | Cycle Inspection  /  Re-inspection      | 6300         | ...
 ...                             | ...   | ...                                     | ...          | ...
Cleaning Data in PostgreSQL Databases

Cleaning string data

 name                            | grade |            inspection_type              | census_tract | ... 
 --------------------------------+-------+-----------------------------------------+--------------+-----
 ...                             | ...   | ...                                     | ...          | ...
 EMPANADAS MONUMENTAL            | B     | Cycle Inspection  /  Re-inspection      | 26900        | ...
 ALPHONSO'S PIZZERIA & TRATTORIA | A     | Cycle Inspection  /  Initial Inspection | 202          | ...
 THE SPARROW TAVERN              | A     | Cycle Inspection  /  Initial Inspection | 12500        | ...
 BURGER KING                     | A     | Cycle Inspection  /  Re-inspection      | 86400        | ...
 ASTORIA PIZZA                   | B     | Cycle Inspection  /  Re-inspection      | 6300         | ...
 ...                             | ...   | ...                                     | ...          | ...
  1. Restrict capitalization in name
  2. Remove extra divider space in inspection_type
  3. Make census_tract values have a uniform length
Cleaning Data in PostgreSQL Databases

Cleaning string data

 name                            | grade |            inspection_type              | census_tract | ... 
 --------------------------------+-------+-----------------------------------------+--------------+-----
 ...                             | ...   | ...                                     | ...          | ...
 EMPANADAS MONUMENTAL            | B     | Cycle Inspection  /  Re-inspection      | 26900        | ...
 ALPHONSO'S PIZZERIA & TRATTORIA | A     | Cycle Inspection  /  Initial Inspection | 202          | ...
 THE SPARROW TAVERN              | A     | Cycle Inspection  /  Initial Inspection | 12500        | ...
 BURGER KING                     | A     | Cycle Inspection  /  Re-inspection      | 86400        | ...
 ASTORIA PIZZA                   | B     | Cycle Inspection  /  Re-inspection      | 6300         | ...
 ...                             | ...   | ...                                     | ...          | ...
 name                            | grade |           inspection_type             | census_tract | ... 
 --------------------------------+-------+---------------------------------------+--------------+-----
 ...                             | ...   | ...                                   | ...          | ...
 Empanadas Monumental            | B     | Cycle Inspection / Re-inspection      | 026900       | ...
 Alphonso'S Pizzeria & Trattoria | A     | Cycle Inspection / Initial Inspection | 000202       | ...
 The Sparrow Tavern              | A     | Cycle Inspection / Initial Inspection | 012500       | ...
 Burger King                     | A     | Cycle Inspection / Re-inspection      | 086400       | ...
 Astoria Pizza                   | B     | Cycle Inspection / Re-inspection      | 006300       | ...
 ...                             | ...   | ...                                   | ...          | ...
Cleaning Data in PostgreSQL Databases

Using the INITCAP() function

INITCAP(input_string) - fixing capitalization

SELECT INITCAP('HELLO FRIEND!');
Hello Friend!
Cleaning Data in PostgreSQL Databases

Using the REPLACE() function

REPLACE(input_string, to_replace, replacement) - replacing one text value with another

SELECT REPLACE('180 Main Street', 'Street', 'St');
180 Main St
Cleaning Data in PostgreSQL Databases

Using the LPAD() function

LPAD(input_string, length [, fill_value]) - prepending text values to a string

SELECT LPAD('123', 7, 'X');
XXXX123
Cleaning Data in PostgreSQL Databases

Building the string cleaning query

SELECT
  INITCAP(name) as name,
  grade,
  REPLACE(inspection_type, '  /  ', ' / ') as inspection_type, 
  LPAD(census_tract, 6, '0') as census_tract
FROM
  restaurant_inspection;
 name                            | grade |           inspection_type             | census_tract | ... 
 --------------------------------+-------+---------------------------------------+--------------+-----
 ...                             | ...   | ...                                   | ...          | ...
 Empanadas Monumental            | B     | Cycle Inspection / Re-inspection      | 026900       | ...
 Alphonso'S Pizzeria & Trattoria | A     | Cycle Inspection / Initial Inspection | 000202       | ...
 The Sparrow Tavern              | A     | Cycle Inspection / Initial Inspection | 012500       | ...
 Burger King                     | A     | Cycle Inspection / Re-inspection      | 086400       | ...
 Astoria Pizza                   | B     | Cycle Inspection / Re-inspection      | 006300       | ...
 ...                             | ...   | ...                                   | ...          | ...
Cleaning Data in PostgreSQL Databases

Let's practice!

Cleaning Data in PostgreSQL Databases

Preparing Video For Download...