Caching

Introduction to Spark SQL in Python

Mark Plutowski

Data Scientist

What is caching?

  • Keeping data in memory
  • Spark tends to unload memory aggressively
Introduction to Spark SQL in Python

Eviction Policy

  • Least Recently Used (LRU)
  • Eviction happens independently on each worker
  • Depends on memory available to each worker
Introduction to Spark SQL in Python

Caching a dataframe

To cache a dataframe:
df.cache()
To uncache it:
df.unpersist()
Introduction to Spark SQL in Python

Determining whether a dataframe is cached

df.is_cached
False
df.cache()
df.is_cached
True
Introduction to Spark SQL in Python

Uncaching a dataframe

df.unpersist()
df.is_cached()
False
Introduction to Spark SQL in Python

Storage level

df.unpersist()
df.cache()
df.storageLevel
StorageLevel(True, True, False, True, 1)

In the storage level above the following hold:

  1. useDisk = True
  2. useMemory = True
  3. useOffHeap = False
  4. deserialized = True
  5. replication = 1
Introduction to Spark SQL in Python

Persisting a dataframe

The following are equivalent in Spark 2.1+ :

  • df.persist()

  • df.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)

  • df.cache() is the same as df.persist()

Introduction to Spark SQL in Python

Caching a table

df.createOrReplaceTempView('df')
spark.catalog.isCached(tableName='df')
False
spark.catalog.cacheTable('df')
spark.catalog.isCached(tableName='df')
True
Introduction to Spark SQL in Python

Uncaching a table

spark.catalog.uncacheTable('df')
spark.catalog.isCached(tableName='df')
False
spark.catalog.clearCache()
Introduction to Spark SQL in Python

Tips

  • Caching is lazy
  • Only cache if more than one operation is to be performed
  • Unpersist when you no longer need the object
  • Cache selectively
Introduction to Spark SQL in Python

Let's practice

Introduction to Spark SQL in Python

Preparing Video For Download...