Cleaning Data with PySpark
Mike Metzger
Data Engineering Consultant
voter_df = df.select(df['VOTER NAME']).distinct()
voter_df.explain()
== Physical Plan ==
*(2) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- Exchange hashpartitioning(VOTER NAME#15, 200)
+- *(1) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- *(1) FileScan csv [VOTER NAME#15] Batched: false, Format: CSV, Location:
InMemoryFileIndex[file:/DallasCouncilVotes.csv.gz],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<VOTER NAME:string>
Shuffling refers to moving data around to various workers to complete a task
.repartition(num_partitions)
.coalesce(num_partitions)
instead.join()
.broadcast()
Broadcasting:
.join()
operationsUse the .broadcast(<DataFrame>)
method
from pyspark.sql.functions import broadcast
combined_df = df_1.join(broadcast(df_2))
Cleaning Data with PySpark