The medallion architecture

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

From raw to ready

$$

  • Raw data lands in the lakehouse - now what?
  • Messy, unvalidated records need a path to clean insights
  • The medallion architecture solves this
Introduction to Databricks Lakehouse

The data quality problem

$$

recraft: half: A confused data engineer looking at messy tangled data streams flowing into a modern laptop screen

$$

  • Raw data from dozens of sources
  • Missing fields, duplicates, mismatched types
  • Analysts need trustworthy data
Introduction to Databricks Lakehouse

Three layers, one purpose

$$

nanobanana: full: Professional kitchen workflow in three stages left to right: raw ingredients in a crate labeled Bronze, a prep station with chopped vegetables labeled Silver, a beautifully plated dish labeled Gold, clean modern illustration

Introduction to Databricks Lakehouse

The bronze layer

$$

  • Captures data as it arrives
  • Inconsistent formats and nullable fields
  • Append-only - nothing deleted or modified
  • Your safety net - always trace back
{"pickup": "2024-03-15T08:23",
 "dropoff": "2024-03-15 8:41",
 "fare": null,
 "zone": "236",
 "distance": "4.2mi"}
{"pickup": "03/15/2024 09:10",
 "dropoff": "2024-03-15T09:32",
 "fare": 18.50,
 "zone": "unknown",
 "distance": 6.1}
Introduction to Databricks Lakehouse

The silver layer

$$

SELECT * FROM silver_taxi_trips
LIMIT 3;
pickup_ts   | dropoff_ts  | fare  | zone_id | dist_km
2024-03-15 | 2024-03-15 | 14.20 | 236     | 6.8
2024-03-15… | 2024-03-15 | 18.50 | 142     | 9.8
2024-03-15 | 2024-03-15 | 22.00 | 79      | 12.1

$$

  • Timestamps → proper datetime types
  • Nulls removed, duplicates eliminated
  • Schema enforced - consistent types
Introduction to Databricks Lakehouse

The gold layer

$$

  • Business-level aggregates
  • Powers dashboards and executive reports
  • Optimized for fast consumption

$$

SELECT * FROM gold_taxi_daily
LIMIT 3;
date       | zone_name      | avg_fare | trips
2024-03-15 | Upper East     | 16.80    | 1,247
2024-03-15 | Midtown        | 22.45    | 3,891
2024-03-15 | Financial Dist | 19.10    | 2,156
Introduction to Databricks Lakehouse

Who typically uses which layer?

nanobanana: full: Three horizontal layers stacked vertically: Gold layer at top with analyst icon, Silver layer in middle with data scientist icon, Bronze layer at bottom with data engineer icon, each labeled with role name, clean infographic style

  • These are common patterns, not strict rules - any role may access any layer
Introduction to Databricks Lakehouse

Summary

$$

  • Bronze - captures everything raw, your safety net
  • Silver - cleans and validates, reliable foundation
  • Gold - aggregates for business, served insights
  • Each layer → different consumers, different standards

$$

nanobanana: half: Three-tiered award podium with bronze, silver, and gold medals representing data quality layers, clean modern style

Introduction to Databricks Lakehouse

Let's practice!

Introduction to Databricks Lakehouse

Preparing Video For Download...