Cluster creation and management

Databricks with the Python SDK

Avi Steinberg

Senior Software Engineer

Serverless vs. managed infrastructure

Serverless

  • Run on infra fully managed by Databricks
  • Focus on code instead of infrastructure
  • Pay for compute on-demand

Managed

  • More control over configuration
  • More cost effective for long running jobs or predictive workloads
Databricks with the Python SDK

Create a Databricks Spark cluster

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
clstr = w.clusters.create(
    cluster_name="datacamp-cluster-name",

spark_version=latest,
autotermination_minutes=20,
num_workers=3, ).result()
1 https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html
Databricks with the Python SDK

Lists clusters

from databricks.sdk import WorkspaceClient
# Instantiate WorkspaceClient
w = WorkspaceClient()

# Print id of each cluster in workspace
clusters = w.clusters.list()
for cluster in clusters:
    print(f"ClusterId={cluster.cluster_id}")

Output:

ClusterId=0113-13328-woj98c32
1 https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html
Databricks with the Python SDK

Start a cluster

from databricks.sdk import WorkspaceClient
import os

w = WorkspaceClient()
cluster_id=os.environ["DATABRICKS_CLUSTER_ID"]

# Start cluster with id stored in cluster_id variable
try:
  w.clusters.start(cluster_id=cluster_id).result()
except: 
  print(f"Cannot start cluster_id={cluster_id} because it is already running")
1 https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html
Databricks with the Python SDK

Check the state of a cluster

from databricks.sdk import WorkspaceClient
import os

w = WorkspaceClient()

# Print state of cluster
cluster_info = w.clusters.get(cluster_id=os.environ["DATABRICKS_CLUSTER_ID"])
print(f"cluster state={cluster_info.state}")

Output:

cluster state=State.RUNNING
1 https://databricks-sdk-py.readthedocs.io/en/latest/workspace/compute/clusters.html
Databricks with the Python SDK

Delete a cluster

from databricks.sdk import WorkspaceClient
import os

w = WorkspaceClient()
# Delete databricks cluster, with id stored in an environment variable
w.delete(cluster_id=os.environ["DATABRICKS_CLUSTER_ID"])

cluster_info = w.clusters.get(cluster_id=os.environ["DATABRICKS_CLUSTER_ID"]) print(f"cluster state={cluster_info.state}")
Output: 
cluster state=State.TERMINATED
Databricks with the Python SDK

Let's practice!

Databricks with the Python SDK

Preparing Video For Download...