Caching

Introduzione a Spark SQL in Python

Mark Plutowski

Data Scientist

Cos'è il caching?

Tenere i dati in memoria
Spark tende a liberare memoria in modo aggressivo

Policy di eviction

Least Recently Used (LRU)
L'eviction avviene in modo indipendente su ogni worker
Dipende dalla memoria disponibile per ogni worker

Cachare un dataframe

Per cachare un dataframe:

df.cache()

Per rimuoverlo dalla cache:

df.unpersist()

Verificare se un dataframe è in cache

df.is_cached

False

df.cache()
df.is_cached

True

Rimuovere cache di un dataframe

df.unpersist()
df.is_cached()

False

Livello di storage

df.unpersist()
df.cache()
df.storageLevel

StorageLevel(True, True, False, True, 1)

Nel livello di storage sopra valgono:

useDisk = True
useMemory = True
useOffHeap = False
deserialized = True
replication = 1

Persistenza di un dataframe

Equivalenti in Spark 2.1+:

df.persist()
df.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK)
df.cache() è uguale a df.persist()

Cachare una tabella

df.createOrReplaceTempView('df')
spark.catalog.isCached(tableName='df')

False

spark.catalog.cacheTable('df')
spark.catalog.isCached(tableName='df')

True

Rimuovere cache di una tabella

spark.catalog.uncacheTable('df')
spark.catalog.isCached(tableName='df')

False

spark.catalog.clearCache()

Consigli

La cache è lazy
Usa la cache solo se esegui più operazioni
Rimuovi la persistenza quando non serve più
Cache selettiva

Vamos praticar!

Introduzione a Spark SQL in Python

Preparing Video For Download...