Cluster types and runtimes

Introduction to Databricks Lakehouse

Gang Wang

Senior Data Scientist

What is a cluster?

$$

  • A set of virtual machines that run your code
  • One driver node coordinates the work
  • One or more worker nodes do the processing
  • Clusters read data from cloud storage

$$

$$

hierarchy: Driver Node, Worker 1, Worker 2, Worker 3, Cloud Storage

Introduction to Databricks Lakehouse

All-purpose clusters

$$

  • Interactive development and ad-hoc queries
  • Shared across multiple users on a team
  • Stay running until you manually stop them
  • Think of it as your personal car - always available, but you pay even when it's parked

$$

recraft: half: A personal car parked in a driveway, always available but costs money even when idle, representing an always-on compute resource

Introduction to Databricks Lakehouse

Jobs clusters

$$

  • Created automatically when a job starts
  • Run a single task, then terminate
  • No manual management required
  • Cost-optimized for production workloads

$$

recraft: half: A modern taxi cab with a glowing meter on top, representing an on-demand compute resource that runs and stops automatically

Introduction to Databricks Lakehouse

When to use which?

$$

comparison: All-Purpose, Interactive dev, Notebook work, Ad-hoc SQL, Team sharing | Jobs, Nightly ETL, Scheduled reports, Auto testing, Production

  • Common mistake: running production pipelines on all-purpose clusters
Introduction to Databricks Lakehouse

Databricks Runtime

$$

  • Versioned software stack that runs on every cluster
  • Includes Apache Spark, Delta Lake, Python, R, Scala
  • LTS versions - long-term support, recommended for production
  • ML Runtime adds pre-installed ML libraries

$$

layers: Runtime 15.4 LTS, Apache Spark 3.5.x, Delta Lake 3.2.x, Python 3.11 / Scala 2.12 / R 4.3, System Libraries

Introduction to Databricks Lakehouse

Summary

$$

  • All-purpose clusters - interactive, shared, manually managed
  • Jobs clusters - automated, single-task, cost-optimized
  • Databricks Runtime - versioned software stack (use LTS for production)
  • Match the cluster type to the workload to control costs
Introduction to Databricks Lakehouse

Let's practice!

Introduction to Databricks Lakehouse

Preparing Video For Download...