Performance improvements

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Explaining the Spark execution plan

voter_df = df.select(df['VOTER NAME']).distinct()
voter_df.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- Exchange hashpartitioning(VOTER NAME#15, 200)
   +- *(1) HashAggregate(keys=[VOTER NAME#15], functions=[])
      +- *(1) FileScan csv [VOTER NAME#15] Batched: false, Format: CSV, Location: 
      InMemoryFileIndex[file:/DallasCouncilVotes.csv.gz], 
      PartitionFilters: [], PushedFilters: [], 
      ReadSchema: struct<VOTER NAME:string>

What is shuffling?

Shuffling refers to moving data around to various workers to complete a task

Hides complexity from the user
Can be slow to complete
Lowers overall throughput
Is often necessary, but try to minimize

How to limit shuffling?

Limit use of .repartition(num_partitions)
- Use .coalesce(num_partitions) instead
Use care when calling .join()
Use .broadcast()
May not need to limit it

Broadcasting

Broadcasting:

Provides a copy of an object to each worker
Prevents undue / excess communication between nodes
Can drastically speed up .join() operations

Use the .broadcast(<DataFrame>) method

from pyspark.sql.functions import broadcast
combined_df = df_1.join(broadcast(df_2))

Let's practice!

Cleaning Data with PySpark