Production pipelines with workflows

Data Transformation with Spark SQL in Databricks

Disha Mukherjee

Lead Data Engineer

Why Delta Lake?

A secure glowing digital vault with organized data tables and a protective shield, flat modern style

ACID transactions → roll back failed writes
Schema enforcement → block mismatched types
Versioning → query any previous state

Writing to Delta

df_valid.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("transactions_clean")

print(f"Rows written: {df_valid.count():,}")

Rows written: 33,223

A new Delta table appears in Unity Catalog

Notebook tasks

task1_ingest: load and clean

Loads CSV → applies cleaning → writes a Delta table

Notebook tasks

task2_metrics: revenue by category

Reads cleaned table → computes metrics → writes to a new table

Notebook tasks

task3_customers: rank by spend

Reads cleaned table → ranks customers → saves to a new table

Creating the job

Jobs and Pipelines UI showing completed three-task job DAG with dependency arrows from task1_ingest to task2_metrics to task3_customers

Can run manually, schedule jobs, and set triggers

Running the job

Job run DAG with task1_ingest and task2_metrics succeeded green and task3_customers failed red

Running the job

Checking an error in task three

Error: customer-id should be Customer_ID

Running the job

Graph views with success

A timeline view

What is Lakeflow?

comparison: Imperative (Jobs) | Declarative (Lakeflow)

Jobs → we manage each step
Lakeflow → declare what tables should contain
Databricks handles order, retries, and compute

The @dlt.table pattern

@dlt.table(name="transactions_bronze")
def transactions_bronze():
    return spark.read.format("csv").schema(schema).load(FILE_PATH)


@dlt.table(name="transactions_silver")
def transactions_silver():
    return dlt.read("transactions_bronze").na.drop(...).filter(...)


@dlt.table(name="category_revenue_gold")
def category_revenue_gold():
    return dlt.read("transactions_silver").groupBy("Category").agg(...)

Pipeline run

Lakeflow pipeline DAG showing transactions_bronze 100K rows then transactions_silver 33K rows then category_revenue_gold 6 rows all green materialized views

Bronze (100K rows) → Silver (33K rows) → Gold (6 rows)

Notebooks, Jobs, or Lakeflow?

layers: Notebooks, Databricks Jobs, Lakeflow Pipelines

Notebooks → exploring, prototyping
Databricks Jobs → multi-step, scheduled pipelines
Lakeflow → fully managed, declarative

Let's practice!

Data Transformation with Spark SQL in Databricks