Using processes and threads

Parallel Programming with Dask in Python

James Fulton

Climate Informatics Researcher

Dask default scheduler

Threads

  • Dask arrays
  • Dask DataFrames
  • Delayed pipelines created with dask.delayed()

Processes

  • Dask bags
Parallel Programming with Dask in Python

Choosing the scheduler

# Use default
result = x.compute()

result = dask.compute(x)
# Use threads result = x.compute(scheduler='threads')
result = dask.compute(x, scheduler='threads')
# Use processes result = x.compute(scheduler='processes')
result = dask.compute(x, scheduler='processes')
Parallel Programming with Dask in Python

Recap - threads vs. processes

Threads

  • Are very fast to initiate
  • No need to transfer data to them
  • Are limited by the GIL, which allows one thread to read the code at once

Processes

  • Take time to set up
  • Slow to transfer data to
  • Each have their own GIL and so don't need to take turns reading the code
Parallel Programming with Dask in Python

Creating a local cluster

from dask.distributed import LocalCluster

cluster = LocalCluster(
    processes=True, 
    n_workers=2,
    threads_per_worker=2
)

print(cluster)
LocalCluster(..., workers=2, threads=4, memory=31.38 GiB)
Parallel Programming with Dask in Python

Creating a local cluster

from dask.distributed import LocalCluster

cluster = LocalCluster(
    processes=False, 
    n_workers=2,
    threads_per_worker=2
)

print(cluster)
LocalCluster(..., workers=2, threads=4, memory=31.38 GiB)
Parallel Programming with Dask in Python

Simple local cluster

cluster = LocalCluster(processes=True)

print(cluster)
LocalCluster(..., workers=4 threads=8, memory=31.38 GiB)
cluster = LocalCluster(processes=False)

print(cluster)
LocalCluster(..., workers=1 threads=8, memory=31.38 GiB)
Parallel Programming with Dask in Python

Creating a client

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
    processes=True, 
    n_workers=4,
    threads_per_worker=2
)

client = Client(cluster)
print(client)
<Client: 'tcp://127.0.0.1:61391' processes=4 threads=8, memory=31.38 GiB>
Parallel Programming with Dask in Python

Creating a client easily

Create cluster then pass it into client

cluster = LocalCluster(
    processes=True, 
    n_workers=4,
    threads_per_worker=2
)

client = Client(cluster)

print(client)
<Client: ... processes=4 threads=8, ...>

Create client which will create its own cluster

client = Client(
    processes=True, 
    n_workers=4,
    threads_per_worker=2
)



print(client)
<Client: ... processes=4 threads=8, ...>
Parallel Programming with Dask in Python

Using the cluster

client = Client(processes=True)

# Default uses the client
result = x.compute()

# Can still change to other schedulers result = x.compute(scheduler='threads')
# Can explicitly use client result = client.compute(x)
Parallel Programming with Dask in Python

Other kinds of cluster

  • LocalCluster() - A cluster on your computer.
  • Other cluster types split computation across different computers
Parallel Programming with Dask in Python

Let's practice!

Parallel Programming with Dask in Python

Preparing Video For Download...