Putting it all together

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

The scenario

You're a data engineer starting a new project at a retail company
Goal: set up compute, verify the data pipeline, check governance, and prepare for deployment

recraft: half: A data engineer at a modern desk with a laptop showing a Databricks-style interface, with pipeline diagrams on a whiteboard behind them, representing onboarding to a new data project

Step 1: Choose the right compute

The pipeline runs nightly on a schedule
No interactive development needed
A jobs cluster with LTS runtime is the right call
Set autoscaling (2–6 workers) for variable data volume

	All-Purpose	Jobs Cluster
Mode	Interactive	Automated
Management	Manual	Auto-terminates
Cost	Higher (idle time)	Cost-optimized

Step 2: Verify the medallion pipeline

SELECT COUNT(*) FROM bronze_sales;
-- 1,247,832 rows

SELECT COUNT(*) FROM silver_sales;
-- 1,189,456 rows (nulls removed)

SELECT COUNT(*) FROM gold_daily_revenue;
-- 365 rows (daily aggregates)

Bronze: raw data, highest row count
Silver: cleaned, nulls and duplicates removed
Gold: aggregated for business reporting
Row counts decrease as quality increases

Step 3: Check governance

Navigate to Unity Catalog and inspect the gold_daily_revenue table
Trace lineage upstream to confirm it reads from silver_sales
Verify access controls - only the analytics team has SELECT on gold
Confirm Delta Sharing is configured for the external partner

recraft: half: A magnifying glass hovering over a data lineage graph showing connected table nodes with checkmarks, representing governance verification

Step 4: Prepare for deployment

Review the databricks.yml Asset Bundle
Confirm the nightly job resource is defined
Check that production target points to the right workspace path
Run bundle validate before deploying

targets:
  production:
    workspace:
      root_path: /Shared/production
resources:
  jobs:
    nightly_sales_etl:
      schedule:
        quartz_cron: "0 0 3 * * ?"

Summary

Compute: match cluster type to workload (jobs cluster for automation)
Data: verify medallion layers are flowing correctly
Governance: trace lineage, check access, confirm sharing
Deployment: review the Asset Bundle, validate, deploy

Let's practice!

Introduction to Databricks Lakehouse