Unity Catalog and Lineage

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

Why governance matters

$$

  • Hundreds of tables across multiple teams
  • "Can the marketing team see customer financial data?"
  • "This report looks wrong - where does the underlying data actually come from?"
  • Unity Catalog centralizes the answers

$$

recraft: half: A maze of interconnected data pipelines and folders with question marks floating above them, representing the complexity of ungoverned data

Introduction to Databricks Lakehouse

The Unity Catalog hierarchy

$$

Unity Catalog hierarchy tree view

$$

  • Metastore - one per account, top-level container
  • Catalogs - like "production" or "development"
  • Schemas - group related objects (e.g. "sales")
  • Contains tables, views, and functions
Introduction to Databricks Lakehouse

Access control

$$

-- Grant read access to a schema
GRANT SELECT
ON SCHEMA production.sales
TO `analytics_team`;

-- Revoke table-level access
REVOKE SELECT
ON TABLE production.sales.customers
FROM `intern_group`;

$$

  • Permissions at every level of the hierarchy
  • Grant SELECT, MODIFY, CREATE on catalogs, schemas, or tables
  • Inherited - grant on a catalog applies to all schemas within it
Introduction to Databricks Lakehouse

Data lineage

Where did this data come from?

$$

flowchart: bronze_orders (source), silver_orders (cleansed), gold_daily_revenue (aggregated), Revenue Dashboard (consumer)

$$

  • Automatic tracking of table-to-table relationships as queries run
  • Trace upstream sources and downstream consumers
Introduction to Databricks Lakehouse

Lineage in practice

$$

  • Upstream - trace where data comes from
  • Downstream - see what depends on this table
  • Impact analysis - if I change this column, what breaks?
  • Debugging - a report shows wrong numbers, trace back to the source

$$

recraft: half: A detective with a magnifying glass examining a trail of connected documents and data tables, representing data lineage investigation

Introduction to Databricks Lakehouse

Summary

$$

  • Unity Catalog - centralized governance for all data assets
  • Hierarchy - metastore → catalog → schema
  • Access control - SQL grants at every level, inherited downward
  • Lineage - automatic tracking of data flow, upstream and downstream
Introduction to Databricks Lakehouse

Let's practice!

Introduction to Databricks Lakehouse

Preparing Video For Download...