Parallel computation frameworks

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Logo of Apache hadoop

HDFS

Diagram of HDFS as distributed file system

Logo of Hadoop MapReduce

Diagram illustrating Olympic events example

Logo of Apache Hive

SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year

Hive to MapReduce diagram

Image of Spark Logo

# Load the dataset into athlete_events_spark first

(athlete_events_spark
  .groupBy('Year')
  .mean('Age')
  .show())

SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year

Introduction to Data Engineering