Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
Caching in Spark:
When developing Spark tasks:
Call .cache()
on the DataFrame before Action
voter_df = spark.read.csv('voter_data.txt.gz')
voter_df.cache().count()
voter_df = voter_df.withColumn('ID', monotonically_increasing_id())
voter_df = voter_df.cache()
voter_df.show()
Check .is_cached
to determine cache status
print(voter_df.is_cached)
True
Call .unpersist()
when finished with DataFrame
voter_df.unpersist()
Cleaning Data with PySpark