Data opschonen met PySpark
Mike Metzger
Data Engineering Consultant
voter_df = df.select(df['VOTER NAME']).distinct()
voter_df.explain()
== Physical Plan ==
*(2) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- Exchange hashpartitioning(VOTER NAME#15, 200)
+- *(1) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- *(1) FileScan csv [VOTER NAME#15] Batched: false, Format: CSV, Location:
InMemoryFileIndex[file:/DallasCouncilVotes.csv.gz],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<VOTER NAME:string>
Shuffling is het verplaatsen van data tussen workers om een taak af te ronden
.repartition(num_partitions).coalesce(num_partitions).join().broadcast()Broadcasting:
.join() sterk versnellenGebruik .broadcast(<DataFrame>)
from pyspark.sql.functions import broadcast
combined_df = df_1.join(broadcast(df_2))
Data opschonen met PySpark