Query optimization strategies

Introduction to BigQuery

Matt Forrest

Field CTO

Three rules of thumb

There are three main optimization rules:

  1. Reduce the amount of data that needs to be processed
  2. Optimize the query operations
  3. Reduce the output size of your query
1 https://cloud.google.com/bigquery/docs/best-practices-performance-compute#use-bi-engine
Introduction to BigQuery

Reducing the amount of data

  • Avoid using SELECT *, and only select the columns of data we need
  • Filter the amount of data in our CTEs early and often
  • Filter our data using the WHERE clause early and often
Introduction to BigQuery

Optimizing joins

  • Make sure we reduce the data we need using CTEs.
  • Join using an INT64 data type.
WITH filter_my_data AS (SELECT 
-- Filter data with 
-- WHERE in the CTE first
)
SELECT 
-- This query will run 
-- faster with less data
JOIN a USING (user_id)
Introduction to BigQuery

Optimizing the WHERE clause

  • In BigQuery, use

    • BOOL
    • INT
    • FLOAT
    • DATE
  • Data types with WHERE, STRING, or BYTE are not optimal.

Not optimal

SELECT user_id, date_ordered
FROM dataset.table
WHERE product = 'shoes'

Optimal

SELECT user_id, date_ordered
FROM dataset.table
WHERE product_id = 1234
Introduction to BigQuery

ORDER BY optimizations

  • ORDER BY should always be at the outermost (end) of our query
  • The only exception to this is using ORDER BY within a window clause
Introduction to BigQuery

ORDER BY without optimization

Not optimal

WITH order_total AS (SELECT
user_id,           
sum(product_price) as order_sum
FROM orders
GROUP BY user_id   
-- Order by is not at the end of the query
ORDER BY last_purchase_date
)
SELECT order_total.order_sum, 
users.user_name
FROM dataset.users users
JOIN order_total USING (user_id);
Introduction to BigQuery

ORDER BY with optimization

Optimal

WITH order_total AS (SELECT
user_id,    
last_purchase_date
sum(product_price) as order_sum
GROUP BY user_id         
)
SELECT order_total.order_sum, 
users.user_name
FROM dataset.users users
JOIN a USING (user_id)
-- Order by should always be at the end
ORDER BY orders_total.last_purchase_date;
Introduction to BigQuery

Using EXISTS vs. COUNT

  • If we only need to know if a record is in the table, using EXISTS
  • Avoid using COUNT to solve this use case
SELECT EXISTS (

  -- Write the main query as a subquery within the exists call
  SELECT
    user_id
  FROM
    dataset.table
  WHERE
    product_category = 'home_goods'
    AND status = 'Closed Account'
);
Introduction to BigQuery

Other optimization methods

  • Use approximate aggregate functions such as APPROX_TOP_SUM or APPROX_COUNT_DISTINCT.
  • Many BigQuery tables are partitioned by date - include dates in the WHERE clause.
Introduction to BigQuery

Let's practice!

Introduction to BigQuery

Preparing Video For Download...