Query optimization strategies

Introduction to BigQuery

Matt Forrest

Field CTO

Three rules of thumb

There are three main optimization rules:

Reduce the amount of data that needs to be processed
Optimize the query operations
Reduce the output size of your query

¹ https://cloud.google.com/bigquery/docs/best-practices-performance-compute#use-bi-engine

Reducing the amount of data

Avoid using SELECT *, and only select the columns of data we need
Filter the amount of data in our CTEs early and often
Filter our data using the WHERE clause early and often

Optimizing joins

Make sure we reduce the data we need using CTEs.
Join using an INT64 data type.

WITH filter_my_data AS (SELECT 
-- Filter data with 
-- WHERE in the CTE first
)
SELECT 
-- This query will run 
-- faster with less data
JOIN a USING (user_id)

Optimizing the WHERE clause

In BigQuery, use
- BOOL
- INT
- FLOAT
- DATE
Data types with WHERE, STRING, or BYTE are not optimal.

Not optimal

SELECT user_id, date_ordered
FROM dataset.table
WHERE product = 'shoes'

Optimal

SELECT user_id, date_ordered
FROM dataset.table
WHERE product_id = 1234

ORDER BY optimizations

ORDER BY should always be at the outermost (end) of our query
The only exception to this is using ORDER BY within a window clause

ORDER BY without optimization

Not optimal

WITH order_total AS (SELECT
user_id,           
sum(product_price) as order_sum
FROM orders
GROUP BY user_id   
-- Order by is not at the end of the query
ORDER BY last_purchase_date
)
SELECT order_total.order_sum, 
users.user_name
FROM dataset.users users
JOIN order_total USING (user_id);

ORDER BY with optimization

Optimal

WITH order_total AS (SELECT
user_id,    
last_purchase_date
sum(product_price) as order_sum
GROUP BY user_id         
)
SELECT order_total.order_sum, 
users.user_name
FROM dataset.users users
JOIN a USING (user_id)
-- Order by should always be at the end
ORDER BY orders_total.last_purchase_date;

Using EXISTS vs. COUNT

If we only need to know if a record is in the table, using EXISTS
Avoid using COUNT to solve this use case

SELECT EXISTS (

  -- Write the main query as a subquery within the exists call
  SELECT
    user_id
  FROM
    dataset.table
  WHERE
    product_category = 'home_goods'
    AND status = 'Closed Account'
);

Other optimization methods

Use approximate aggregate functions such as APPROX_TOP_SUM or APPROX_COUNT_DISTINCT.
Many BigQuery tables are partitioned by date - include dates in the WHERE clause.

Let's practice!

Introduction to BigQuery