Partitioning and lazy processing

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Partitioning

  • DataFrames are broken up into partitions
  • Partition size can vary
  • Each partition is handled independently
Cleaning Data with PySpark

Lazy processing

  • Transformations are lazy
    • .withColumn(...)
    • .select(...)
  • Nothing is actually done until an action is performed
    • .count()
    • .write(...)
  • Transformations can be re-ordered for best performance
  • Sometimes causes unexpected behavior
Cleaning Data with PySpark

Adding IDs

Normal ID fields:

  • Common in relational databases
  • Most usually an integer increasing, sequential and unique
  • Not very parallel
id last name first name state
0 Smith John TX
1 Wilson A. IL
2 Adams Wendy OR
Cleaning Data with PySpark

Monotonically increasing IDs

pyspark.sql.functions.monotonically_increasing_id()

  • Integer (64-bit), increases in value, unique
  • Not necessarily sequential (gaps exist)
  • Completely parallel
id last name first name state
0 Smith John TX
134520871 Wilson A. IL
675824594 Adams Wendy OR
Cleaning Data with PySpark

Notes

Remember, Spark is lazy!

  • Occasionally out of order
  • If performing a join, ID may be assigned after the join
  • Test your transformations
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...