Data cleaning

Concetti di Data Warehousing

Aaren Stubberfield

Data Scientist

Video agenda

  • Data format revision
  • Address parsing
  • Data validation
  • De-duplication
Concetti di Data Warehousing

Data format cleaning

  • Update values to an expected format
    • Dates
    • Names of options
    • Capitalization
  • Ensures output is in a consistent format

Taxi data example

Two tables combined into one

Concetti di Data Warehousing

Address parsing

  • Dividing a street address into its components
  • Can use tools to validate addresses
Address
1234 S Normal St, Cleveland, OH 44102
Address City State Zip
1234 S Normal St Cleveland OH 44102
Concetti di Data Warehousing

Data validation

  • Range check
    • Is the value within the expected range?
    • Example: A person's age
  • Type check
    • Is the value the proper data type?
    • Example: Storing age as string vs number

Table of Age with the first row listed as 300, which is flagged as not valid.

Table with three columns with Age as the first column, listed as a data type of string, which is flagged as not valid.

Concetti di Data Warehousing

Duplicate row elimination

  • This process gets rid of duplicate entries

image of two tables with two columns to be appended together

the final appended table without the duplicate row

Concetti di Data Warehousing

Data governance

Regulatory compliance illustration

Concetti di Data Warehousing

Let's practice!

Concetti di Data Warehousing

Preparing Video For Download...