PySpark aggregations

Introduction to PySpark

Benjamin Schmidt

Data Engineer

PySpark SQL aggregations overview

  • Common SQL aggregations work with spark.sql()
    # SQL aggregation query
    spark.sql("""
      SELECT Department, SUM(Salary) AS Total_Salary, AVG(Salary) AS Average_Salary
      FROM employees
      GROUP BY Department
    """).show()
    
Introduction to PySpark

Combining DataFrame and SQL operations

# Filter salaries over 3000
filtered_df = df.filter(df.Salary > 3000)

# Register filtered DataFrame as a view
filtered_df.createOrReplaceTempView("filtered_employees")

# Aggregate using SQL on the filtered view spark.sql(""" SELECT Department, COUNT(*) AS Employee_Count FROM filtered_employees GROUP BY Department """).show()
Introduction to PySpark

Handling data types in aggregations

# Example of type casting
data = [("HR", "3000"), ("IT", "4000"), ("Finance", "3500")]
columns = ["Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)

# Convert Salary column to integer df = df.withColumn("Salary", df["Salary"].cast("int")) # Perform aggregation df.groupBy("Department").sum("Salary").show()
Introduction to PySpark

RDDs for aggregations

# Example of aggregation with RDDs
rdd = df.rdd.map(lambda row: (row["Department"], row["Salary"]))

rdd_aggregated = rdd.reduceByKey(lambda x, y: x + y)
print(rdd_aggregated.collect())
Introduction to PySpark

Best practices for PySpark aggregations

  • Filter early: Reduce data size before performing aggregations
  • Handle data types: Ensure data is clean and correctly typed
  • Avoid operations that use the entire dataset: Minimize operations like groupBy()
  • Choose the right interface: Prefer DataFrames for most tasks due to their optimizations
  • Monitor performance: Use explain() to inspect the execution plan and optimize accordingly
Introduction to PySpark

Key takeaways

  • PySpark SQL Aggregations: Functions like SUM() and AVERAGE() for summarizing data
  • DataFrames and SQL: Combining both approaches for flexible data manipulation
  • Handling Data Types: Addressing issues with type mismatches during aggregations
  • RDDs vs DataFrames: Understanding the trade-offs and choosing the right tool
Introduction to PySpark

Let's practice!

Introduction to PySpark

Preparing Video For Download...