Why the Lakehouse?

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

What you'll learn

$$

$$

stacked-lockup-full-color-rgb-1200x669-0d49921.png

$$

Explore Databricks Lakehouse

  1. How data is organized in a lakehouse
  2. How to manage compute and notebooks
  3. How to govern and share data securely
  4. How to deploy your work to production
Introduction to Databricks Lakehouse

Meet the Instructor

       

Gang Wang

  • Senior Data Scientist

  • Origin Energy, Australia (2021-Present)

  • 9+ Years post-PhD experience

       

Introduction to Databricks Lakehouse

The data lake promise

$$

recraft: half: A large open warehouse filled with colorful barrels and crates of various sizes, representing diverse raw data storage

$$

  • Store any data at low cost
  • Structured, semi-structured, unstructured
  • Flexible schema-on-read
Introduction to Databricks Lakehouse

The lake's dark side

$$

  • No guarantees on data quality
  • No built-in governance or access control
  • "Data swamp" - hard to find trustworthy data
  • Separate tools needed for analytics and AI

$$

recraft: half: A dark murky swamp with tangled vines and scattered files sinking into muddy water, representing a data swamp with lost and unreliable data

Introduction to Databricks Lakehouse

The warehouse trade-off

$$

  • Reliable and performant
  • Strong governance
  • But expensive and rigid
  • Limited to structured data only

$$

recraft: half: A tidy organized library with neatly stacked books on shelves, representing structured reliable data storage

Introduction to Databricks Lakehouse

Enter the Lakehouse

$$

comparison: Data Lake, Flexible, Cheap, Unstructured | Data Warehouse, Reliable, Governed, Rigid

Introduction to Databricks Lakehouse

What makes it work in Databricks?

$$

  • Open file formats (Delta Lake)
  • ACID transactions on the lake
  • Unified governance (Unity Catalog)
  • One platform for analytics, AI, and apps

$$

nanobanana: half: Four stacked building blocks forming a tower, labeled from bottom to top: Open Formats (Delta Lake), ACID Transactions, Unified Governance (Unity Catalog), One Platform for Analytics and AI, clean modern infographic style

Introduction to Databricks Lakehouse

Lakehouse improves data quality

$$

flowchart: Raw Data, Schema Enforcement, ACID Transactions, Governed and Trusted

$$

  • Schema enforcement at write time
  • Transaction logs track every change
  • Time travel - query historical versions
Introduction to Databricks Lakehouse

Summary

$$

  • Data lakes - flexible but unreliable
  • Data warehouses - reliable but rigid and costly
  • Lakehouse - combines both, one platform
  • Powered by Delta Lake, ACID, and Unity Catalog
Introduction to Databricks Lakehouse

Let's practice!

Introduction to Databricks Lakehouse

Preparing Video For Download...