Airflow scheduling

Introduction to Apache Airflow in Python

Mike Metzger

Data Engineer

DAG Runs

  • A specific instance of a workflow at a point in time
  • Can be run manually or via schedule_interval
  • Maintain state for each workflow and the tasks within
    • running
    • failed
    • success
1 https://airflow.apache.org/docs/stable/scheduler.html
Introduction to Apache Airflow in Python

DAG Runs view

DAG Runs

Introduction to Apache Airflow in Python

DAG Runs state

DAG Runs State

Introduction to Apache Airflow in Python

Schedule details

When scheduling a DAG, there are several attributes of note:

  • start_date - The date / time to initially schedule the DAG run
  • end_date - Optional attribute for when to stop running new DAG instances
  • max_tries - Optional attribute for how many attempts to make
  • schedule_interval - How often to run
Introduction to Apache Airflow in Python

Schedule interval

schedule_interval represents:

  • How often to schedule the DAG
  • Between the start_date and end_date
  • Can be defined via cron style syntax or via built-in presets.
Introduction to Apache Airflow in Python

cron syntax

Cron syntax

  • Is pulled from the Unix cron format
  • Consists of 5 fields separated by a space
  • An asterisk * represents running for every interval (ie, every minute, every day, etc)
  • Can be comma separated values in fields for a list of values
Introduction to Apache Airflow in Python

cron examples

0 12 * * *              # Run daily at noon
* * 25 2 *              # Run once per minute on February 25
0,15,30,45 * * * *      # Run every 15 minutes
Introduction to Apache Airflow in Python

Airflow scheduler presets

Preset:

  • @hourly
  • @daily
  • @weekly
  • @monthly
  • @yearly

cron equivalent:

  • 0 * * * *
  • 0 0 * * *
  • 0 0 * * 0
  • 0 0 1 * *
  • 0 0 1 1 *
1 https://airflow.apache.org/docs/stable/scheduler.html
Introduction to Apache Airflow in Python

Special presets

Airflow has two special schedule_interval presets:

  • None - Don't schedule ever, used for manually triggered DAGs
  • @once - Schedule only once
Introduction to Apache Airflow in Python

schedule_interval issues

When scheduling a DAG, Airflow will:

  • Use the start_date as the earliest possible value
  • Schedule the task at start_date + schedule_interval
'start_date': datetime(2020, 2, 25),
'schedule_interval': @daily

This means the earliest starting time to run the DAG is on February 26th, 2020

Introduction to Apache Airflow in Python

Let's practice!

Introduction to Apache Airflow in Python

Preparing Video For Download...