Parallel computation frameworks

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

 

Logo of Apache hadoop

Introduction to Data Engineering

HDFS

 

Diagram of HDFS as distributed file system

Introduction to Data Engineering

MapReduce

 

Logo of Hadoop MapReduce

 

Diagram illustrating Olympic events example

Introduction to Data Engineering

Hive

 

  • Runs on Hadoop
  • Structured Query Language: Hive SQL
  • Initially MapReduce, now other tools

Logo of Apache Hive

Introduction to Data Engineering

Hive: an example

 

SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year

 

Hive to MapReduce diagram

Introduction to Data Engineering

Image of Spark Logo

  • Avoid disk writes
  • Maintained by Apache Software Foundation
Introduction to Data Engineering

Resilient distributed datasets (RDD)

 

  • Spark relies on them
  • Similar to list of tuples
  • Transformations: .map() or .filter()
  • Actions: .count() or .first()
Introduction to Data Engineering

PySpark

 

  • Python interface to Spark
  • DataFrame abstraction
  • Looks similar to Pandas
Introduction to Data Engineering

PySpark: an example

# Load the dataset into athlete_events_spark first

(athlete_events_spark
  .groupBy('Year')
  .mean('Age')
  .show())
SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year
Introduction to Data Engineering

Let's practice!

Introduction to Data Engineering

Preparing Video For Download...