Workflow scheduling frameworks

Introduction to Data Engineering

Vincent Vankrunkelsven

Data Engineer @ DataCamp

An example pipeline

 

Example simple pipeline that extracts from csv using Spark

How to schedule?

  • Manually
  • cron scheduling tool
  • What about dependencies?
Introduction to Data Engineering

DAGs

Directed Acyclic Graph

  • Set of nodes
  • Directed edges
  • No cycles

Example DAG

Introduction to Data Engineering

The tools for the job

 

  • Linux's cron
  • Spotify's Luigi
  • Apache Airflow
Introduction to Data Engineering

Logo of Apache Airflow

  • Created at Airbnb
  • DAGs
  • Python
Introduction to Data Engineering

Airflow: an example DAG

 

Example Airflow DAG

Introduction to Data Engineering

Airflow: an example in code

# Create the DAG object
dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *")

# Define operations start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag) ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag) ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag) enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag)
# Set up dependency flow start_cluster.set_downstream(ingest_customer_data) ingest_customer_data.set_downstream(enrich_customer_data) ingest_product_data.set_downstream(enrich_customer_data)
Introduction to Data Engineering

Let's practice!

Introduction to Data Engineering

Preparing Video For Download...