PySpark at scale

Introduction to PySpark

Benjamin Schmidt

Data Engineer

Leveraging scale

Pyspark works effectively with gigabytes and terabytes of data
Using PySpark, speed and efficient processing is the goal
Understanding PySpark execution gets even more efficiencies
Use broadcast to manage the whole cluster

joined_df = large_df.join(broadcast(small_df), 
                          on="key_column", how="inner")
joined_df.show()

Execution plans

# Using explain() to view the execution plan
df.filter(df.Age > 40).select("Name").explain()

== Physical Plan ==
*(1) Filter (isnotnull(Age) AND (Age > 30))
+- Scan ExistingRDD[Name:String, Age:Int]

¹ https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.explain.html

Caching and persisting DataFrames

Caching: Stores data in memory, for faster access for smaller datasets
Persisting: Stores data in different storage levels for larger datasets

df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)

# Cache the DataFrame
df.cache()


# Perform multiple operations on the cached DataFrame
df.filter(df["column1"] > 50).show()
df.groupBy("column2").count().show()

Persisting DataFrames with different storage levels

# Persist the DataFrame with storage level
from pyspark import StorageLevel

df.persist(StorageLevel.MEMORY_AND_DISK)


# Perform transformations
result = df.groupBy("column3").agg({"column4": "sum"})
result.show()

# Unpersist after use
df.unpersist()

Optimizing PySpark

Small Subsections: The more data that gets used, the slower the operation: Pick tools like map() over groupby() due to selectivity of methods
Broadcast Joins: Broadcast will use all compute, even on smaller datasets
Avoid Repeated Actions: Repeated actions on the same data costs time and compute, without any benefit

Let's practice!

Introduction to PySpark