Caching

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What is caching?

Caching in Spark:

When developing Spark tasks:

Cache only if you need it
Try caching DataFrames at various points and determine if your performance improves
Cache in memory and fast SSD / NVMe storage
Cache to slow local disk if needed
Use intermediate files!
Stop caching objects when finished

Call .cache() on the DataFrame before Action

voter_df = spark.read.csv('voter_data.txt.gz')
voter_df.cache().count()

voter_df = voter_df.withColumn('ID', monotonically_increasing_id())
voter_df = voter_df.cache()
voter_df.show()

Check .is_cached to determine cache status

print(voter_df.is_cached)

True

Call .unpersist() when finished with DataFrame

voter_df.unpersist()

Cleaning Data with PySpark