PySpark aggregations

Introduction to PySpark

Benjamin Schmidt

Data Engineer

PySpark SQL aggregations overview

Common SQL aggregations work with spark.sql()

# SQL aggregation query
spark.sql("""
  SELECT Department, SUM(Salary) AS Total_Salary, AVG(Salary) AS Average_Salary
  FROM employees
  GROUP BY Department
""").show()

Combining DataFrame and SQL operations

# Filter salaries over 3000
filtered_df = df.filter(df.Salary > 3000)

# Register filtered DataFrame as a view
filtered_df.createOrReplaceTempView("filtered_employees")


# Aggregate using SQL on the filtered view
spark.sql("""
    SELECT Department, COUNT(*) AS Employee_Count
    FROM filtered_employees
    GROUP BY Department
""").show()

Handling data types in aggregations

# Example of type casting
data = [("HR", "3000"), ("IT", "4000"), ("Finance", "3500")]
columns = ["Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)


# Convert Salary column to integer
df = df.withColumn("Salary", df["Salary"].cast("int"))

# Perform aggregation
df.groupBy("Department").sum("Salary").show()

RDDs for aggregations

# Example of aggregation with RDDs
rdd = df.rdd.map(lambda row: (row["Department"], row["Salary"]))


rdd_aggregated = rdd.reduceByKey(lambda x, y: x + y)


print(rdd_aggregated.collect())

Best practices for PySpark aggregations

Filter early: Reduce data size before performing aggregations
Handle data types: Ensure data is clean and correctly typed
Avoid operations that use the entire dataset: Minimize operations like groupBy()
Choose the right interface: Prefer DataFrames for most tasks due to their optimizations
Monitor performance: Use explain() to inspect the execution plan and optimize accordingly

Key takeaways

PySpark SQL Aggregations: Functions like SUM() and AVERAGE() for summarizing data
DataFrames and SQL: Combining both approaches for flexible data manipulation
Handling Data Types: Addressing issues with type mismatches during aggregations
RDDs vs DataFrames: Understanding the trade-offs and choosing the right tool

Let's practice!

Introduction to PySpark