Cluster creation and management

Databricks with the Python SDK

Avi Steinberg

Senior Software Engineer

Serverless vs. managed infrastructure

Serverless

Run on infra fully managed by Databricks
Focus on code instead of infrastructure
Pay for compute on-demand

Managed

More control over configuration
More cost effective for long running jobs or predictive workloads

Create a Databricks Spark cluster

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
clstr = w.clusters.create(
    cluster_name="datacamp-cluster-name",

    spark_version=latest,

    autotermination_minutes=20,

    num_workers=3,
).result()

¹ https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html

Lists clusters

from databricks.sdk import WorkspaceClient
# Instantiate WorkspaceClient
w = WorkspaceClient()

# Print id of each cluster in workspace
clusters = w.clusters.list()
for cluster in clusters:
    print(f"ClusterId={cluster.cluster_id}")

Output:

ClusterId=0113-13328-woj98c32

¹ https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html

Start a cluster

from databricks.sdk import WorkspaceClient
import os

w = WorkspaceClient()
cluster_id=os.environ["DATABRICKS_CLUSTER_ID"]

# Start cluster with id stored in cluster_id variable
try:
  w.clusters.start(cluster_id=cluster_id).result()
except: 
  print(f"Cannot start cluster_id={cluster_id} because it is already running")

¹ https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html

Check the state of a cluster

from databricks.sdk import WorkspaceClient
import os

w = WorkspaceClient()

# Print state of cluster
cluster_info = w.clusters.get(cluster_id=os.environ["DATABRICKS_CLUSTER_ID"])
print(f"cluster state={cluster_info.state}")

Output:

cluster state=State.RUNNING

¹ https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html

Delete a cluster

from databricks.sdk import WorkspaceClient
import os

w = WorkspaceClient()
# Delete databricks cluster, with id stored in an environment variable
w.delete(cluster_id=os.environ["DATABRICKS_CLUSTER_ID"])


cluster_info = w.clusters.get(cluster_id=os.environ["DATABRICKS_CLUSTER_ID"])
print(f"cluster state={cluster_info.state}")

Output: 
cluster state=State.TERMINATED

Let's practice!

Databricks with the Python SDK