Tools of the data engineer

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

Databases

Image of a database

Simple product entity

Image representing parallel processing

df = spark.read.parquet("users.parquet")

outliers = df.filter(df["age"] > 100)

print(outliers.count())

Data engineer understands the abstractions.

Diagram of cleaning and joining job

JoinProductOrder needs to run after CleanProduct and CleanOrder

Databases

MySQL logo

PostgreSQL logo

Processing

Spark logo

Hive logo

Scheduling

Airflow logo

Oozie logo

Tux, the linux pinguin

An image of an example data pipeline

Introduction to Data Engineering