Performance improvements

Cleaning Data with PySpark

Mike Metzger

Data Engineering Consultant

Explaining the Spark execution plan

voter_df = df.select(df['VOTER NAME']).distinct()
voter_df.explain()
== Physical Plan ==
*(2) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- Exchange hashpartitioning(VOTER NAME#15, 200)
   +- *(1) HashAggregate(keys=[VOTER NAME#15], functions=[])
      +- *(1) FileScan csv [VOTER NAME#15] Batched: false, Format: CSV, Location: 
      InMemoryFileIndex[file:/DallasCouncilVotes.csv.gz], 
      PartitionFilters: [], PushedFilters: [], 
      ReadSchema: struct<VOTER NAME:string>
Cleaning Data with PySpark

What is shuffling?

Shuffling refers to moving data around to various workers to complete a task

  • Hides complexity from the user
  • Can be slow to complete
  • Lowers overall throughput
  • Is often necessary, but try to minimize
Cleaning Data with PySpark

How to limit shuffling?

  • Limit use of .repartition(num_partitions)
    • Use .coalesce(num_partitions) instead
  • Use care when calling .join()
  • Use .broadcast()
  • May not need to limit it
Cleaning Data with PySpark

Broadcasting

Broadcasting:

  • Provides a copy of an object to each worker
  • Prevents undue / excess communication between nodes
  • Can drastically speed up .join() operations

Use the .broadcast(<DataFrame>) method

from pyspark.sql.functions import broadcast
combined_df = df_1.join(broadcast(df_2))
Cleaning Data with PySpark

Let's practice!

Cleaning Data with PySpark

Preparing Video For Download...