Intro to data cleaning with Apache Spark

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What is Data Cleaning?

Data Cleaning: Preparing raw data for use in data processing pipelines.

Possible tasks in data cleaning:

  • Reformatting or replacing text
  • Performing calculations
  • Removing garbage or incomplete data
Cleaning Data with PySpark

Why perform data cleaning with Spark?

Problems with typical data systems:

  • Performance
  • Organizing data flow

Advantages of Spark:

  • Scalable
  • Powerful framework for data handling
Cleaning Data with PySpark

Data cleaning example

Raw data:

name age (years) city
Smith, John 37 Dallas
Wilson, A. 59 Chicago
null 215

Cleaned data:

last name first name age (months) state
Smith John 444 TX
Wilson A. 708 IL
Cleaning Data with PySpark

Spark Schemas

  • Define the format of a DataFrame
  • May contain various data types:
    • Strings, dates, integers, arrays
  • Can filter garbage data during import
  • Improves read performance
Cleaning Data with PySpark

Example Spark Schema

Import schema

import pyspark.sql.types
peopleSchema = StructType([
  # Define the name field
  StructField('name', StringType(), True),
  # Add the age field
  StructField('age', IntegerType(), True),
  # Add the city field
  StructField('city', StringType(), True)  
])

Read CSV file containing data

people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...