Parallel computation frameworks

Introduzione al Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

 

Logo of Apache hadoop

Introduzione al Data Engineering

HDFS

 

Diagram of HDFS as distributed file system

Introduzione al Data Engineering

MapReduce

 

Logo of Hadoop MapReduce

 

Diagram illustrating Olympic events example

Introduzione al Data Engineering

Hive

 

  • Runs on Hadoop
  • Structured Query Language: Hive SQL
  • Initially MapReduce, now other tools

Logo of Apache Hive

Introduzione al Data Engineering

Hive: an example

 

SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year

 

Hive to MapReduce diagram

Introduzione al Data Engineering

Image of Spark Logo

  • Avoid disk writes
  • Maintained by Apache Software Foundation
Introduzione al Data Engineering

Resilient distributed datasets (RDD)

 

  • Spark relies on them
  • Similar to list of tuples
  • Transformations: .map() or .filter()
  • Actions: .count() or .first()
Introduzione al Data Engineering

PySpark

 

  • Python interface to Spark
  • DataFrame abstraction
  • Looks similar to Pandas
Introduzione al Data Engineering

PySpark: an example

# Load the dataset into athlete_events_spark first

(athlete_events_spark
  .groupBy('Year')
  .mean('Age')
  .show())
SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year
Introduzione al Data Engineering

Let's practice!

Introduzione al Data Engineering

Preparing Video For Download...