Cluster types and runtimes
Introduction to Databricks Lakehouse
Gang Wang
Senior Data Scientist
What is a cluster?
$$
A
set of virtual machines
that run your code
One
driver
node coordinates the work
One or more
worker
nodes do the processing
Clusters read data from cloud storage
$$
$$
All-purpose clusters
$$
Interactive
development and ad-hoc queries
Shared
across multiple users on a team
Stay running until you
manually
stop them
Think of it as your personal car - always available, but you pay even when it's parked
$$
Jobs clusters
$$
Created
automatically
when a job starts
Run a
single task
, then terminate
No manual management required
Cost-optimized
for production workloads
$$
When to use which?
$$
Common mistake:
running production pipelines on all-purpose clusters
Databricks Runtime
$$
Versioned software stack
that runs on every cluster
Includes Apache Spark, Delta Lake, Python, R, Scala
LTS
versions - long-term support, recommended for production
ML Runtime
adds pre-installed ML libraries
$$
Summary
$$
All-purpose clusters
- interactive, shared, manually managed
Jobs clusters
- automated, single-task, cost-optimized
Databricks Runtime
- versioned software stack (use LTS for production)
Match the cluster type to the workload to control costs
Let's practice!
Introduction to Databricks Lakehouse
Preparing Video For Download...