Tools of the data engineer

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Databases

 

  • Hold large amounts of data
  • Support application

 

  • Other databases are used for analyses

 

Image of a database

Simple product entity

Introduction to Data Engineering

Processing

  • Clean data
  • Aggregate data
  • Join data

Image representing parallel processing

Introduction to Data Engineering

Processing: an example

df = spark.read.parquet("users.parquet")

outliers = df.filter(df["age"] > 100)

print(outliers.count())

 

Data engineer understands the abstractions.

Introduction to Data Engineering

Scheduling

 

  • Plan jobs with specific intervals
  • Resolve dependency requirements of jobs

 

Diagram of cleaning and joining job

JoinProductOrder needs to run after CleanProduct and CleanOrder

Introduction to Data Engineering

Existing tools

Databases

MySQL logo

PostgreSQL logo

Processing

Spark logo

Hive logo

Scheduling

Airflow logo

Oozie logo

Tux, the linux pinguin

Introduction to Data Engineering

A data pipeline

An image of an example data pipeline

Introduction to Data Engineering

Let's practice!

Introduction to Data Engineering

Preparing Video For Download...