Data Aggregation

Transform and Analyze Data with Microsoft Fabric

Luis Silva

Solution Architect - Data & AI

When you should aggregate data?

  • Condense the number of rows in a dataset, using an aggregation function to produce summaries.
    • Count
    • Summarize
    • Average
    • Maximum
    • Minimum
Transform and Analyze Data with Microsoft Fabric

When you should aggregate data?

  • Condense the number of rows in a dataset, using an aggregation function to produce summaries.
    • Count
    • Summarize
    • Average
    • Maximum
    • Minimum

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

Transform and Analyze Data with Microsoft Fabric

Tools for implementing aggregation of data

 

 

Icons representing three tools: SQL, Spark and Dataflows

Transform and Analyze Data with Microsoft Fabric

Aggregating data with SQL

  • Common SQL aggregation functions:
    • SUM()
    • COUNT()
    • AVG()
    • MIN()
    • MAX()
  • Usually used in combination with GROUP BY
  • Statistical functions
    • STDEV()
    • VAR()
SELECT
  <unaggregated columns>,
  function(<aggregated column>)
FROM 
  <table>
GROUP BY 
  <unaggregated columns>;
Transform and Analyze Data with Microsoft Fabric

Aggregating data with SQL

SELECT 
  [State], 
  COUNT([Order_ID]) AS [Num Orders], 
  SUM([Order_Amount]) AS [Total Amount]
FROM 
  [tbl_Orders]
GROUP BY 
  [State]

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

Transform and Analyze Data with Microsoft Fabric

Aggregating data with Spark

  • Common PySpark aggregation functions:
    • sum()
    • count()
    • avg()
    • min() and max()
    • first() and last()
  • Statistical functions
    • stdev()
    • variance()
  • Used with groupBy() and agg()
df.groupBy(<unaggregated columns>)
.agg(function(<aggregated column>))

Transform and Analyze Data with Microsoft Fabric

Aggregating data with Spark

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

from pyspark.sql.functions import sum

df.groupBy("state").agg(count("order_id"), sum("order_amount")).show()
Transform and Analyze Data with Microsoft Fabric

Aggregating data with Spark

  • Aggregation functions must be imported from pyspark.sql.functions by including a statement at the star of your code.
#----- Import one or multiple functions:
from pyspark.sql.functions import sum, avg, count, min, max

#----- Import all SQL functions:
from pyspark.sql.functions import * 

#----- Import all SQL functions with an alias:
import pyspark.sql.functions as F
# call sum: F.sum()
Transform and Analyze Data with Microsoft Fabric

Aggregating data with Dataflows

  • Group by Transform
    • Sum
    • Average
    • Median
    • Min
    • Max
    • Percentile
    • Count rows

Screenshot of the Group by dialog in Dataflows

Transform and Analyze Data with Microsoft Fabric

Aggregating data with Dataflows

Screenshot of the Group by dialog in a Dataflow, specifying that the data should be grouped by state and with two aggregations, count of rows and sum of the order amount column

Table with sales orders per state aggregated into a new table with the total number of orders and total sales amount per state

Transform and Analyze Data with Microsoft Fabric

Let's practice!

Transform and Analyze Data with Microsoft Fabric

Preparing Video For Download...