Caching

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

What is caching?

Caching in Spark:

  • Stores DataFrames in memory or on disk
  • Improves speed on later transformations / actions
  • Reduces resource usage
Cleaning Data with PySpark

Disadvantages of caching

  • Very large data sets may not fit in memory
  • Local disk based caching may not be a performance improvement
  • Cached objects may not be available
Cleaning Data with PySpark

Caching tips

When developing Spark tasks:

  • Cache only if you need it
  • Try caching DataFrames at various points and determine if your performance improves
  • Cache in memory and fast SSD / NVMe storage
  • Cache to slow local disk if needed
  • Use intermediate files!
  • Stop caching objects when finished
Cleaning Data with PySpark

Implementing caching

Call .cache() on the DataFrame before Action

voter_df = spark.read.csv('voter_data.txt.gz')
voter_df.cache().count()
voter_df = voter_df.withColumn('ID', monotonically_increasing_id())
voter_df = voter_df.cache()
voter_df.show()
Cleaning Data with PySpark

More cache operations

Check .is_cached to determine cache status

print(voter_df.is_cached)
True

Call .unpersist() when finished with DataFrame

voter_df.unpersist()
Cleaning Data with PySpark

Let's Practice!

Cleaning Data with PySpark

Preparing Video For Download...