Transforming data

Introduction to Databricks SQL

Kevin Barlow

Data Manager

Motivation

Lakehouse Flow - Ingestion

Introduction to Databricks SQL

Transformation in the lakehouse

Lakeflow Diagram - Transformation

Introduction to Databricks SQL

Cleaning and transforming data

  • Cleaning data in the Bronze (raw) layer into the Silver (analytics-ready) layer
  • Important step for downstream data tables
  • Common activities
    • Removing NULL values
    • Standardize values
    • Adjusting data types
CREATE TABLE silver_layer AS (
SELECT DISTINCT c.id, 
    c.last_name || ', ' || c.first_name 
        AS name, 
    format(s.date,'YYYY-mm-dd') 
        AS sale_date, 
    round(s.price, 2) 
        AS sale_price
    s.item_name
FROM sales_data s
LEFT JOIN contacts c on c.id = s.id)
Introduction to Databricks SQL

Aggregating data

  • Combining and simplifying data from Silver layer into Gold (BI-ready) layer
  • Meant for a specific business intelligence need
    • Great candidate for views
  • Common activities
    • Removing extraneous columns
    • Aggregating across dimensions
    • Calculating metrics / KPIs
CREATE VIEW q3_revenue AS (
SELECT sum(revenue) AS total_rev,
  count(*) AS total_count,
  total_rev / total_count AS avg_sale,
  category,
  item
FROM silver_layer
WHERE date BETWEEN '2024-07-01' 
  AND '2024-09-30'
GROUP BY category, item)
Introduction to Databricks SQL

Automating tasks

Workflows GIF

Introduction to Databricks SQL

Let's practice!

Introduction to Databricks SQL

Preparing Video For Download...