Kerangka komputasi paralel

Pengantar Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

 

Logo Apache Hadoop

Pengantar Data Engineering

HDFS

 

Diagram HDFS sebagai sistem berkas terdistribusi

Pengantar Data Engineering

MapReduce

 

Logo Hadoop MapReduce

 

Diagram contoh event Olimpiade

Pengantar Data Engineering

Hive

 

  • Berjalan di Hadoop
  • Structured Query Language: Hive SQL
  • Awalnya MapReduce, kini alat lain juga

Logo Apache Hive

Pengantar Data Engineering

Hive: contoh

 

SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year

 

Diagram Hive ke MapReduce

Pengantar Data Engineering

Gambar logo Spark

  • Hindari penulisan ke disk
  • Dikelola Apache Software Foundation
Pengantar Data Engineering

Resilient Distributed Datasets (RDD)

 

  • Spark bergantung padanya
  • Mirip daftar tuple
  • Transformasi: .map() atau .filter()
  • Aksi: .count() atau .first()
Pengantar Data Engineering

PySpark

 

  • Antarmuka Python untuk Spark
  • Abstraksi DataFrame
  • Mirip dengan Pandas
Pengantar Data Engineering

PySpark: contoh

# Load the dataset into athlete_events_spark first

(athlete_events_spark
  .groupBy('Year')
  .mean('Age')
  .show())
SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year
Pengantar Data Engineering

Ayo berlatih!

Pengantar Data Engineering

Preparing Video For Download...