Introductie tot opschonen van data

Data opschonen in PostgreSQL-databases

Darryl Reeves, Ph.D.

Industry Assistant Professor, New York University

Waarom data opschonen?

  • Data is rommelig
  • Voor analyse is vaak opschonen nodig
  • Kolomtypebeperkingen helpen
  • Focus op gevallen zonder/naast defensieve aanpakken
Data opschonen in PostgreSQL-databases

Strings opschonen

  • Stringdata: veelvoorkomend, flexibel en vaak rommelig
 name                            | grade |            inspection_type              | census_tract | ... 
 --------------------------------+-------+-----------------------------------------+--------------+-----
 ...                             | ...   | ...                                     | ...          | ...
 EMPANADAS MONUMENTAL            | B     | Cycle Inspection  /  Re-inspection      | 26900        | ...
 ALPHONSO'S PIZZERIA & TRATTORIA | A     | Cycle Inspection  /  Initial Inspection | 202          | ...
 THE SPARROW TAVERN              | A     | Cycle Inspection  /  Initial Inspection | 12500        | ...
 BURGER KING                     | A     | Cycle Inspection  /  Re-inspection      | 86400        | ...
 ASTORIA PIZZA                   | B     | Cycle Inspection  /  Re-inspection      | 6300         | ...
 ...                             | ...   | ...                                     | ...          | ...
Data opschonen in PostgreSQL-databases

Strings opschonen

 name                            | grade |            inspection_type              | census_tract | ... 
 --------------------------------+-------+-----------------------------------------+--------------+-----
 ...                             | ...   | ...                                     | ...          | ...
 EMPANADAS MONUMENTAL            | B     | Cycle Inspection  /  Re-inspection      | 26900        | ...
 ALPHONSO'S PIZZERIA & TRATTORIA | A     | Cycle Inspection  /  Initial Inspection | 202          | ...
 THE SPARROW TAVERN              | A     | Cycle Inspection  /  Initial Inspection | 12500        | ...
 BURGER KING                     | A     | Cycle Inspection  /  Re-inspection      | 86400        | ...
 ASTORIA PIZZA                   | B     | Cycle Inspection  /  Re-inspection      | 6300         | ...
 ...                             | ...   | ...                                     | ...          | ...
  1. Beperk kapitalisatie in name
  2. Verwijder extra spaties rond de scheiding in inspection_type
  3. Maak census_tract overal even lang
Data opschonen in PostgreSQL-databases

Strings opschonen

 name                            | grade |            inspection_type              | census_tract | ... 
 --------------------------------+-------+-----------------------------------------+--------------+-----
 ...                             | ...   | ...                                     | ...          | ...
 EMPANADAS MONUMENTAL            | B     | Cycle Inspection  /  Re-inspection      | 26900        | ...
 ALPHONSO'S PIZZERIA & TRATTORIA | A     | Cycle Inspection  /  Initial Inspection | 202          | ...
 THE SPARROW TAVERN              | A     | Cycle Inspection  /  Initial Inspection | 12500        | ...
 BURGER KING                     | A     | Cycle Inspection  /  Re-inspection      | 86400        | ...
 ASTORIA PIZZA                   | B     | Cycle Inspection  /  Re-inspection      | 6300         | ...
 ...                             | ...   | ...                                     | ...          | ...
 name                            | grade |           inspection_type             | census_tract | ... 
 --------------------------------+-------+---------------------------------------+--------------+-----
 ...                             | ...   | ...                                   | ...          | ...
 Empanadas Monumental            | B     | Cycle Inspection / Re-inspection      | 026900       | ...
 Alphonso'S Pizzeria & Trattoria | A     | Cycle Inspection / Initial Inspection | 000202       | ...
 The Sparrow Tavern              | A     | Cycle Inspection / Initial Inspection | 012500       | ...
 Burger King                     | A     | Cycle Inspection / Re-inspection      | 086400       | ...
 Astoria Pizza                   | B     | Cycle Inspection / Re-inspection      | 006300       | ...
 ...                             | ...   | ...                                   | ...          | ...
Data opschonen in PostgreSQL-databases

De INITCAP()-functie gebruiken

INITCAP(input_string) - corrigeert hoofdletters

SELECT INITCAP('HELLO FRIEND!');
Hello Friend!
Data opschonen in PostgreSQL-databases

De REPLACE()-functie gebruiken

REPLACE(input_string, to_replace, replacement) - vervangt tekst door andere tekst

SELECT REPLACE('180 Main Street', 'Street', 'St');
180 Main St
Data opschonen in PostgreSQL-databases

De LPAD()-functie gebruiken

LPAD(input_string, length [, fill_value]) - voegt vooraf tekens toe aan een string

SELECT LPAD('123', 7, 'X');
XXXX123
Data opschonen in PostgreSQL-databases

De query voor stringopschoning bouwen

SELECT
  INITCAP(name) as name,
  grade,
  REPLACE(inspection_type, '  /  ', ' / ') as inspection_type, 
  LPAD(census_tract, 6, '0') as census_tract
FROM
  restaurant_inspection;
 name                            | grade |           inspection_type             | census_tract | ... 
 --------------------------------+-------+---------------------------------------+--------------+-----
 ...                             | ...   | ...                                   | ...          | ...
 Empanadas Monumental            | B     | Cycle Inspection / Re-inspection      | 026900       | ...
 Alphonso'S Pizzeria & Trattoria | A     | Cycle Inspection / Initial Inspection | 000202       | ...
 The Sparrow Tavern              | A     | Cycle Inspection / Initial Inspection | 012500       | ...
 Burger King                     | A     | Cycle Inspection / Re-inspection      | 086400       | ...
 Astoria Pizza                   | B     | Cycle Inspection / Re-inspection      | 006300       | ...
 ...                             | ...   | ...                                   | ...          | ...
Data opschonen in PostgreSQL-databases

Laten we oefenen!

Data opschonen in PostgreSQL-databases

Preparing Video For Download...