Putting it all together

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

The scenario

$$

  • You're a data engineer starting a new project at a retail company
  • Goal: set up compute, verify the data pipeline, check governance, and prepare for deployment

$$

recraft: half: A data engineer at a modern desk with a laptop showing a Databricks-style interface, with pipeline diagrams on a whiteboard behind them, representing onboarding to a new data project

Introduction to Databricks Lakehouse

Step 1: Choose the right compute

$$

  • The pipeline runs nightly on a schedule
  • No interactive development needed
  • A jobs cluster with LTS runtime is the right call
  • Set autoscaling (2–6 workers) for variable data volume

$$

All-Purpose Jobs Cluster
Mode Interactive Automated
Management Manual Auto-terminates
Cost Higher (idle time) Cost-optimized
Introduction to Databricks Lakehouse

Step 2: Verify the medallion pipeline

$$

SELECT COUNT(*) FROM bronze_sales;
-- 1,247,832 rows

SELECT COUNT(*) FROM silver_sales;
-- 1,189,456 rows (nulls removed)

SELECT COUNT(*) FROM gold_daily_revenue;
-- 365 rows (daily aggregates)

$$

  • Bronze: raw data, highest row count
  • Silver: cleaned, nulls and duplicates removed
  • Gold: aggregated for business reporting
  • Row counts decrease as quality increases
Introduction to Databricks Lakehouse

Step 3: Check governance

$$

  • Navigate to Unity Catalog and inspect the gold_daily_revenue table
  • Trace lineage upstream to confirm it reads from silver_sales
  • Verify access controls - only the analytics team has SELECT on gold
  • Confirm Delta Sharing is configured for the external partner

$$

recraft: half: A magnifying glass hovering over a data lineage graph showing connected table nodes with checkmarks, representing governance verification

Introduction to Databricks Lakehouse

Step 4: Prepare for deployment

$$

  • Review the databricks.yml Asset Bundle
  • Confirm the nightly job resource is defined
  • Check that production target points to the right workspace path
  • Run bundle validate before deploying

$$

targets:
  production:
    workspace:
      root_path: /Shared/production
resources:
  jobs:
    nightly_sales_etl:
      schedule:
        quartz_cron: "0 0 3 * * ?"
Introduction to Databricks Lakehouse

Summary

$$

  • Compute: match cluster type to workload (jobs cluster for automation)
  • Data: verify medallion layers are flowing correctly
  • Governance: trace lineage, check access, confirm sharing
  • Deployment: review the Asset Bundle, validate, deploy
Introduction to Databricks Lakehouse

Let's practice!

Introduction to Databricks Lakehouse

Preparing Video For Download...