What is parallel computing

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Idea behind parallel computing

Basis of modern data processing tools

Memory
Processing power

Idea

Split task into subtasks
Distribute subtasks over several computers
Work together to finish task

Diagram of task being split into subtasks

The tailor shop

Diagram representing the tailor shop

Running a tailor shop

Goal: 100 shirts

Best tailor finishes shirt / 20 minutes
Other tailors do shirt / 1 hour

Multiple tailors working together > best tailor

Benefits of parallel computing

Processing power
Memory: partition the dataset

RAM memory chip: Image of RAM memory chip

Risks of parallel computing

Overhead due to communication

Task needs to be large
Need several processing units

Parallel slowdown: Plot showing parallel slowdown

An example

Diagram illustrating Olympic events example

multiprocessing.Pool

from multiprocessing import Pool


def take_mean_age(year_and_group):
    year, group = year_and_group
    return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])



with Pool(4) as p:
    results = p.map(take_mean_age, athlete_events.groupby("Year"))



result_df = pd.concat(results)

dask

import dask.dataframe as dd


# Partition dataframe into 4
athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)


# Run parallel computations on each partition
result_df = athlete_events_dask.groupby('Year').Age.mean().compute()

Let's practice!

Introduction to Data Engineering