Introduction to Data Engineering
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Basis of modern data processing tools
Idea
Running a tailor shop
Goal: 100 shirts
Multiple tailors working together > best tailor
RAM memory chip:
Overhead due to communication
Parallel slowdown:
multiprocessing.Pool
from multiprocessing import Pool
def take_mean_age(year_and_group): year, group = year_and_group return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])
with Pool(4) as p: results = p.map(take_mean_age, athlete_events.groupby("Year"))
result_df = pd.concat(results)
dask
import dask.dataframe as dd
# Partition dataframe into 4 athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)
# Run parallel computations on each partition result_df = athlete_events_dask.groupby('Year').Age.mean().compute()
Introduction to Data Engineering