What is parallel computing

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Idea behind parallel computing

Basis of modern data processing tools

  • Memory
  • Processing power

Idea

  • Split task into subtasks
  • Distribute subtasks over several computers
  • Work together to finish task

Diagram of task being split into subtasks

Introduction to Data Engineering

The tailor shop

Diagram representing the tailor shop

Running a tailor shop

Goal: 100 shirts

  • Best tailor finishes shirt / 20 minutes
  • Other tailors do shirt / 1 hour

 

Multiple tailors working together > best tailor

Introduction to Data Engineering

Benefits of parallel computing

  • Processing power
  • Memory: partition the dataset

 

RAM memory chip: Image of RAM memory chip

Introduction to Data Engineering

Risks of parallel computing

Overhead due to communication

 

  • Task needs to be large
  • Need several processing units

 

Parallel slowdown: Plot showing parallel slowdown

Introduction to Data Engineering

An example

 

Diagram illustrating Olympic events example

Introduction to Data Engineering

multiprocessing.Pool

from multiprocessing import Pool

def take_mean_age(year_and_group): year, group = year_and_group return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])
with Pool(4) as p: results = p.map(take_mean_age, athlete_events.groupby("Year"))
result_df = pd.concat(results)
Introduction to Data Engineering

dask

 

import dask.dataframe as dd

# Partition dataframe into 4 athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)
# Run parallel computations on each partition result_df = athlete_events_dask.groupby('Year').Age.mean().compute()
Introduction to Data Engineering

Let's practice!

Introduction to Data Engineering

Preparing Video For Download...