Data cleaning

Data Warehousing Concepts

Aaren Stubberfield

Data Scientist

Video agenda

  • Data format revision
  • Address parsing
  • Data validation
  • De-duplication
Data Warehousing Concepts

Data format cleaning

  • Update values to an expected format
    • Dates
    • Names of options
    • Capitalization
  • Ensures output is in a consistent format

Taxi data example

Two tables combined into one

Data Warehousing Concepts

Address parsing

  • Dividing a street address into its components
  • Can use tools to validate addresses
Address
1234 S Normal St, Cleveland, OH 44102
Address City State Zip
1234 S Normal St Cleveland OH 44102
Data Warehousing Concepts

Data validation

  • Range check
    • Is the value within the expected range?
    • Example: A person's age
  • Type check
    • Is the value the proper data type?
    • Example: Storing age as string vs number

Table of Age with the first row listed as 300, which is flagged as not valid.

Table with three columns with Age as the first column, listed as a data type of string, which is flagged as not valid.

Data Warehousing Concepts

Duplicate row elimination

  • This process gets rid of duplicate entries

image of two tables with two columns to be appended together

the final appended table without the duplicate row

Data Warehousing Concepts

Data governance

Regulatory compliance illustration

Data Warehousing Concepts

Let's practice!

Data Warehousing Concepts

Preparing Video For Download...