Pengantar Data Engineering
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Dasar alat pemrosesan data modern
Gagasan


Menjalankan toko penjahit
Target: 100 kemeja
Banyak penjahit bersama > penjahit terbaik
Cip memori RAM:

Overhead karena komunikasi
Perlambatan paralel:


multiprocessing.Pool
from multiprocessing import Pooldef take_mean_age(year_and_group): year, group = year_and_group return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])with Pool(4) as p: results = p.map(take_mean_age, athlete_events.groupby("Year"))result_df = pd.concat(results)
dask
import dask.dataframe as dd# Membagi dataframe menjadi 4 partisi athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)# Jalankan komputasi paralel pada tiap partisi result_df = athlete_events_dask.groupby('Year').Age.mean().compute()
Pengantar Data Engineering