Introduction to Data Pipelines

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What is a data pipeline?

A set of steps to process data from source(s) to final output
Can consist of any number of steps or components
Can span many systems
We will focus on data pipelines within Spark

What does a data pipeline look like?

Input(s)
- CSV, JSON, web services, databases
Transformations
- withColumn(), .filter(), .drop()
Output(s)
- CSV, Parquet, database
Validation
Analysis

Pipeline details

Not formally defined in Spark

Typically all normal Spark code required for task

schema = StructType([
StructField('name', StringType(), False),
StructField('age', StringType(), False)
])
df = spark.read.format('csv').load('datafile').schema(schema)
df = df.withColumn('id', monotonically_increasing_id())
...
df.write.parquet('outdata.parquet')
df.write.json('outdata.json')

Let's Practice!

Cleaning Data with PySpark