Data validation

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Definition

Validation is:

  • Verifying that a dataset complies with the expected format
  • Number of rows / columns
  • Data types
  • Complex validation rules
Cleaning Data with PySpark

Validating via joins

  • Compares data against known values
  • Easy to find data in a given set
  • Comparatively fast
parsed_df = spark.read.parquet('parsed_data.parquet')
company_df = spark.read.parquet('companies.parquet')
verified_df = parsed_df.join(company_df, parsed_df.company == company_df.company)

This automatically removes any rows with a company not in the valid_df!

Cleaning Data with PySpark

Complex rule validation

Using Spark components to validate logic:

  • Calculations
  • Verifying against external source
  • Likely uses a UDF to modify / verify the DataFrame
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...