Lakehouse Federation

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

The problem: data silos

$$

recraft: half: Multiple disconnected islands with data tables on each island and bridges missing between them, representing data silos across different systems

$$

  • Data spread across multiple systems
  • PostgreSQL, MySQL, Snowflake, SQL Server
  • Moving data is slow and creates copies
  • You need answers that span all your data
Introduction to Databricks Lakehouse

What is Lakehouse Federation?

$$

architecture: Lakehouse Federation

Introduction to Databricks Lakehouse

Setting up a connection

$$

-- Create a connection
CREATE CONNECTION partner_pg
TYPE postgresql
OPTIONS (
  host 'db.partner.com',
  port '5432',
  user 'reader',
  password secret('scope', 'key')
);

$$

-- Create a foreign catalog
CREATE FOREIGN CATALOG partner_db
USING CONNECTION partner_pg;
-- Query the external table
SELECT *
FROM partner_db.public.orders
LIMIT 10;

Introduction to Databricks Lakehouse

When to federate vs. when to ingest

$$

Federate Ingest
Access Real-time Batch copy
Query volume Low High
Latency Source-dependent Low
Compliance No copies needed Data moves
Transforms Limited Full medallion

$$

  • Federate for real-time, low-volume, or compliance-restricted access
  • Ingest for high-volume, repeated, performance-critical workloads
Introduction to Databricks Lakehouse

Federation in Unity Catalog

$$

  • Federated tables appear in the Unity Catalog hierarchy
  • Same access control and lineage as local tables
  • Join federated tables with local lakehouse tables in a single query
  • Supported sources: PostgreSQL, MySQL, Snowflake, SQL Server, and more

$$

recraft: half: Two modern buildings connected by a glowing bridge of data streams, with a central catalog directory in the middle, representing federated data governance

Introduction to Databricks Lakehouse

Summary

$$

  • Lakehouse Federation queries external data without copying it
  • Set up a connection, create a foreign catalog, query with standard SQL
  • Federate for real-time, low-volume, or compliance-restricted access
  • Ingest for high-volume, repeated, performance-critical workloads
Introduction to Databricks Lakehouse

Let's practice!

Introduction to Databricks Lakehouse

Preparing Video For Download...