PySpark at scale

Introduction to PySpark

Benjamin Schmidt

Data Engineer

Leveraging scale

  • Pyspark works effectively with gigabytes and terabytes of data
  • Using PySpark, speed and efficient processing is the goal
  • Understanding PySpark execution gets even more efficiencies
  • Use broadcast to manage the whole cluster
joined_df = large_df.join(broadcast(small_df), 
                          on="key_column", how="inner")
joined_df.show()
Introduction to PySpark

Execution plans

# Using explain() to view the execution plan
df.filter(df.Age > 40).select("Name").explain()
== Physical Plan ==
*(1) Filter (isnotnull(Age) AND (Age > 30))
+- Scan ExistingRDD[Name:String, Age:Int]
1 https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.explain.html
Introduction to PySpark

Caching and persisting DataFrames

  • Caching: Stores data in memory, for faster access for smaller datasets
  • Persisting: Stores data in different storage levels for larger datasets
df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)

# Cache the DataFrame
df.cache()

# Perform multiple operations on the cached DataFrame df.filter(df["column1"] > 50).show() df.groupBy("column2").count().show()
Introduction to PySpark

Persisting DataFrames with different storage levels

# Persist the DataFrame with storage level
from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK)

# Perform transformations result = df.groupBy("column3").agg({"column4": "sum"}) result.show() # Unpersist after use df.unpersist()
Introduction to PySpark

Optimizing PySpark

  • Small Subsections: The more data that gets used, the slower the operation: Pick tools like map() over groupby() due to selectivity of methods
  • Broadcast Joins: Broadcast will use all compute, even on smaller datasets
  • Avoid Repeated Actions: Repeated actions on the same data costs time and compute, without any benefit
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...