Data cleaning
Data Warehousing Concepts
Aaren Stubberfield
Data Scientist
Video agenda
Data format revision
Address parsing
Data validation
De-duplication
Data format cleaning
Update values to an expected format
Dates
Names of options
Capitalization
Ensures output is in a consistent format
Taxi data example
Address parsing
Dividing a street address into its components
Can use tools to validate addresses
Address
1234 S Normal St, Cleveland, OH 44102
Address
City
State
Zip
1234 S Normal St
Cleveland
OH
44102
Data validation
Range check
Is the value within the expected range?
Example: A person's age
Type check
Is the value the proper data type?
Example: Storing age as string vs number
Duplicate row elimination
This process gets rid of duplicate entries
Data governance
Let's practice!
Data Warehousing Concepts
Preparing Video For Download...