Partitioning and lazy processing

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Partitioning

DataFrames are broken up into partitions
Partition size can vary
Each partition is handled independently

Lazy processing

Transformations are lazy
- .withColumn(...)
- .select(...)
Nothing is actually done until an action is performed
- .count()
- .write(...)
Transformations can be re-ordered for best performance
Sometimes causes unexpected behavior

Adding IDs

Normal ID fields:

Common in relational databases
Most usually an integer increasing, sequential and unique
Not very parallel

id	last name	first name	state
0	Smith	John	TX
1	Wilson	A.	IL
2	Adams	Wendy	OR

Monotonically increasing IDs

pyspark.sql.functions.monotonically_increasing_id()

Integer (64-bit), increases in value, unique
Not necessarily sequential (gaps exist)
Completely parallel

id	last name	first name	state
0	Smith	John	TX
134520871	Wilson	A.	IL
675824594	Adams	Wendy	OR

Notes

Remember, Spark is lazy!

Occasionally out of order
If performing a join, ID may be assigned after the join
Test your transformations

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...