Special aggregations in BigQuery

Introduction to BigQuery

Matthew Forrest

Field CTO

Introduction to special aggregates

Function	Category	Description
`APPROX_COUNT_DISTINCT`, `APPROX_QUANTILES`, `APPROX_TOP_COUNT`, `APPROX_TOP_SUM`	Approximate Aggregation	Provide estimates for certain calculations, reducing processing time and resource consumption.
`ARRAY_CONCAT_AGG`, `STRING_AGG`	Array and String Manipulation	Collect, concatenate, and manipulate arrays and strings.
`LOGICAL_AND`, `LOGICAL_OR`	Logical Operations	Evaluate logical AND and OR operations on a set of boolean expressions.

ARRAY_CONCAT_AGG

ARRAY_CONCAT_AGG() addresses the limitations of ARRAY_AGG()

SELECT
  order_id,
  ARRAY_CONCAT_AGG(order_items) AS all_items
FROM sales_data
GROUP BY order_id;

| order_id | all_items                          |
|----------|------------------------------------|
| 1        | [shoes, electronics]               |
| 2        | [electronics, household, clothing] |

STRING_AGG

STRING_AGG() concatenates strings into a single string

SELECT STRING_AGG(customer_id, ', ') AS all_customers
FROM sales_data
WHERE delivery_date 
BETWEEN CURRENT_TIMESTAMP() 
AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 3 DAY) ;

| all_customers   |
|-----------------|
| 123, 456, 789   |

APPROX_COUNT_DISTINCT

APPROX_COUNT_DISTINCT() gives quick estimation of the count of distinct values

SELECT
  customer_id,
  APPROX_COUNT_DISTINCT(order_id) AS unique_orders
FROM sales_data
GROUP BY customer_id;

| customer_id | unique_orders |
|-------------|---------------|
| 1           | 5             |
| 2           | 2             |

APPROX_QUANTILES

APPROX_QUANTILES() gives approximate quantile values

SELECT
  category,
  APPROX_QUANTILES(value, 4) AS quartiles
FROM sales_data
GROUP BY category;

| category    | quartiles         |
|-------------|-------------------|
| shoes       | [25, 40, 55, 100] |
| electronics | [10, 40, 95, 300] |

APPROX_TOP_COUNT

APPROX_TOP_COUNT() identifies the top K elements based on their occurrence

SELECT
  category,
  APPROX_TOP_COUNT(customer_id, 3) AS customers
FROM sales_data
GROUP BY category;

| category    | customers   |
|-------------|-------------|
| shoes       | [1, 7, 19]  |
| electronics | [8, 19, 22] |

APPROX_TOP_SUM

APPROX_TOP_SUM(el, weight, K) finds the top K elements el based on weight

SELECT
  seller_id,
  APPROX_TOP_SUM(item_id, cost, 3) AS top_items
FROM sales_data
GROUP BY seller_id;

| seller_id | top_items                                    |
|-----------|----------------------------------------------|
| 1         | [[1024, 5000], [1567, 3000], [3489, 2000]]   |
| 2         | [[5647, 6500], [9867, 3500], [1074, 2100]]   |

LOGICAL_AND and LOGICAL_OR

SELECT
  customer_id,
  -- true if ALL are true
  LOGICAL_AND(order_status = 'shipped') AS all_shipped,
  -- true if at least one is true
  LOGICAL_OR(order_status = 'shipped') AS one_shipped
FROM sales_data
GROUP BY customer_id;

| customer_id | all_shipped | one_shipped |
|-------------|-------------|-------------|
| 1           | false       | true        |
| 2           | true        | true        |

Let's practice!

Introduction to BigQuery