Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Validation is:
parsed_df = spark.read.parquet('parsed_data.parquet')
company_df = spark.read.parquet('companies.parquet')
verified_df = parsed_df.join(company_df, parsed_df.company == company_df.company)
This automatically removes any rows with a company not in the valid_df
!
Using Spark components to validate logic:
Cleaning Data with PySpark