Introduction to Data Pipelines

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What is a data pipeline?

  • A set of steps to process data from source(s) to final output
  • Can consist of any number of steps or components
  • Can span many systems
  • We will focus on data pipelines within Spark
Cleaning Data with PySpark

What does a data pipeline look like?

  • Input(s)
    • CSV, JSON, web services, databases
  • Transformations
    • withColumn(), .filter(), .drop()
  • Output(s)
    • CSV, Parquet, database
  • Validation
  • Analysis
Cleaning Data with PySpark

Pipeline details

  • Not formally defined in Spark
  • Typically all normal Spark code required for task
    schema = StructType([
    StructField('name', StringType(), False),
    StructField('age', StringType(), False)
    ])
    df = spark.read.format('csv').load('datafile').schema(schema)
    df = df.withColumn('id', monotonically_increasing_id())
    ...
    df.write.parquet('outdata.parquet')
    df.write.json('outdata.json')
    
Cleaning Data with PySpark

Let's Practice!

Cleaning Data with PySpark

Preparing Video For Download...