Production pipelines with workflows

Data Transformation with Spark SQL in Databricks

Disha Mukherjee

Lead Data Engineer

Why Delta Lake?

A secure glowing digital vault with organized data tables and a protective shield, flat modern style

 

$$

  • ACID transactions → roll back failed writes
  • Schema enforcement → block mismatched types
  • Versioning → query any previous state
Data Transformation with Spark SQL in Databricks

Writing to Delta

df_valid.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("transactions_clean")

print(f"Rows written: {df_valid.count():,}")
Rows written: 33,223

$$

$$

  • A new Delta table appears in Unity Catalog
Data Transformation with Spark SQL in Databricks

Notebook tasks

task1_ingest: load and clean

  • Loads CSV → applies cleaning → writes a Delta table
Data Transformation with Spark SQL in Databricks

Notebook tasks

task2_metrics: revenue by category

  • Reads cleaned table → computes metrics → writes to a new table
Data Transformation with Spark SQL in Databricks

Notebook tasks

task3_customers: rank by spend

  • Reads cleaned table → ranks customers → saves to a new table
Data Transformation with Spark SQL in Databricks

Creating the job

Jobs and Pipelines UI showing completed three-task job DAG with dependency arrows from task1_ingest to task2_metrics to task3_customers

  • Can run manually, schedule jobs, and set triggers
Data Transformation with Spark SQL in Databricks

Running the job

Job run DAG with task1_ingest and task2_metrics succeeded green and task3_customers failed red

Data Transformation with Spark SQL in Databricks

Running the job

Checking an error in task three

  • Error: customer-id should be Customer_ID
Data Transformation with Spark SQL in Databricks

Running the job

Graph views with success

A timeline view

Data Transformation with Spark SQL in Databricks

What is Lakeflow?

$$

$$

comparison: Imperative (Jobs) | Declarative (Lakeflow)

 

  • Jobs → we manage each step
  • Lakeflow → declare what tables should contain
  • Databricks handles order, retries, and compute
Data Transformation with Spark SQL in Databricks

The @dlt.table pattern

@dlt.table(name="transactions_bronze")
def transactions_bronze():
    return spark.read.format("csv").schema(schema).load(FILE_PATH)

@dlt.table(name="transactions_silver") def transactions_silver(): return dlt.read("transactions_bronze").na.drop(...).filter(...)
@dlt.table(name="category_revenue_gold") def category_revenue_gold(): return dlt.read("transactions_silver").groupBy("Category").agg(...)
Data Transformation with Spark SQL in Databricks

Pipeline run

Lakeflow pipeline DAG showing transactions_bronze 100K rows then transactions_silver 33K rows then category_revenue_gold 6 rows all green materialized views

  • Bronze (100K rows) → Silver (33K rows) → Gold (6 rows)
Data Transformation with Spark SQL in Databricks

Notebooks, Jobs, or Lakeflow?

$$

layers: Notebooks, Databricks Jobs, Lakeflow Pipelines

 

$$

  • Notebooks → exploring, prototyping
  • Databricks Jobs → multi-step, scheduled pipelines
  • Lakeflow → fully managed, declarative
Data Transformation with Spark SQL in Databricks

Let's practice!

Data Transformation with Spark SQL in Databricks

Preparing Video For Download...