Common data engineering patterns

Introduction to Databricks SQL

Kevin Barlow

Data Manager

Motivation

Lakehouse Diagram - Transformation

Introduction to Databricks SQL

Handling incoming data

Incremental append

  • Add all new data to end of existing table
    INSERT INTO students 
    TABLE visiting_students;
    

Incremental Append Diagram

Change Data Capture (CDC)

  • Integrates data into existing table
    MERGE INTO target USING source
    ON target.key = source.key
    WHEN MATCHED THEN UPDATE SET *;
    

CDC Diagram

Introduction to Databricks SQL

Data optimizations

OPTIMIZE

  • Compact subset of data together
  • Reduces the "small file problem"

Z-ORDER

  • Similar to indexing for database systems
  • Co-locates related data to same files
  • Can reduce time to read data
> OPTIMIZE table_name;

> OPTIMIZE table_name 
    WHERE date >= '2024-01-01';

> OPTIMIZE table_name
    WHERE date >= current_timestamp() 
        - INTERVAL 1 day
    ZORDER BY (eventType);
Introduction to Databricks SQL

Let's practice!

Introduction to Databricks SQL

Preparing Video For Download...