Data Aggregation

Transform and Analyze Data with Microsoft Fabric

Luis Silva

Solution Architect - Data & AI

When you should aggregate data?

Condense the number of rows in a dataset, using an aggregation function to produce summaries.
- Count
- Summarize
- Average
- Maximum
- Minimum

When you should aggregate data?

Condense the number of rows in a dataset, using an aggregation function to produce summaries.
- Count
- Summarize
- Average
- Maximum
- Minimum

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

Tools for implementing aggregation of data

Icons representing three tools: SQL, Spark and Dataflows

Aggregating data with SQL

Common SQL aggregation functions:
- SUM()
- COUNT()
- AVG()
- MIN()
- MAX()
Usually used in combination with GROUP BY
Statistical functions
- STDEV()
- VAR()

SELECT
  <unaggregated columns>,
  function(<aggregated column>)
FROM 
  <table>
GROUP BY 
  <unaggregated columns>;

Aggregating data with SQL

SELECT 
  [State], 
  COUNT([Order_ID]) AS [Num Orders], 
  SUM([Order_Amount]) AS [Total Amount]
FROM 
  [tbl_Orders]
GROUP BY 
  [State]

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

Aggregating data with Spark

Common PySpark aggregation functions:
- sum()
- count()
- avg()
- min() and max()
- first() and last()
Statistical functions
- stdev()
- variance()
Used with groupBy() and agg()

df.groupBy(<unaggregated columns>)
.agg(function(<aggregated column>))

Aggregating data with Spark

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

from pyspark.sql.functions import sum

df.groupBy("state").agg(count("order_id"), sum("order_amount")).show()

Aggregating data with Spark

Aggregation functions must be imported from pyspark.sql.functions by including a statement at the star of your code.

#----- Import one or multiple functions:
from pyspark.sql.functions import sum, avg, count, min, max

#----- Import all SQL functions:
from pyspark.sql.functions import * 

#----- Import all SQL functions with an alias:
import pyspark.sql.functions as F
# call sum: F.sum()

Aggregating data with Dataflows

Group by Transform
- Sum
- Average
- Median
- Min
- Max
- Percentile
- Count rows

Screenshot of the Group by dialog in Dataflows

Aggregating data with Dataflows

Screenshot of the Group by dialog in a Dataflow, specifying that the data should be grouped by state and with two aggregations, count of rows and sum of the order amount column

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

Let's practice!

Transform and Analyze Data with Microsoft Fabric