Intro to data cleaning with Apache Spark

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What is Data Cleaning?

Data Cleaning: Preparing raw data for use in data processing pipelines.

Possible tasks in data cleaning:

Reformatting or replacing text
Performing calculations
Removing garbage or incomplete data

Why perform data cleaning with Spark?

Problems with typical data systems:

Performance
Organizing data flow

Advantages of Spark:

Scalable
Powerful framework for data handling

Data cleaning example

Raw data:

name	age (years)	city
Smith, John	37	Dallas
Wilson, A.	59	Chicago
null	215

Cleaned data:

last name	first name	age (months)	state
Smith	John	444	TX
Wilson	A.	708	IL

Spark Schemas

Define the format of a DataFrame
May contain various data types:
- Strings, dates, integers, arrays
Can filter garbage data during import
Improves read performance

Example Spark Schema

Import schema

import pyspark.sql.types
peopleSchema = StructType([
  # Define the name field
  StructField('name', StringType(), True),
  # Add the age field
  StructField('age', IntegerType(), True),
  # Add the city field
  StructField('city', StringType(), True)  
])

Read CSV file containing data

people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)

Let's practice!

Cleaning Data with PySpark