Airflow scheduling

Introduzione ad Apache Airflow in Python

Mike Metzger

Data Engineer

DAG Runs

  • A specific instance of a workflow at a point in time
  • Can be run manually or via schedule_interval
  • Maintain state for each workflow and the tasks within
    • running
    • failed
    • success
1 https://airflow.apache.org/docs/stable/scheduler.html
Introduzione ad Apache Airflow in Python

DAG Runs view

DAG Runs

Introduzione ad Apache Airflow in Python

DAG Runs state

DAG Runs State

Introduzione ad Apache Airflow in Python

Schedule details

When scheduling a DAG, there are several attributes of note:

  • start_date - The date / time to initially schedule the DAG run
  • end_date - Optional attribute for when to stop running new DAG instances
  • max_tries - Optional attribute for how many attempts to make
  • schedule_interval - How often to run
Introduzione ad Apache Airflow in Python

Schedule interval

schedule_interval represents:

  • How often to schedule the DAG
  • Between the start_date and end_date
  • Can be defined via cron style syntax or via built-in presets.
Introduzione ad Apache Airflow in Python

cron syntax

Cron syntax

  • Is pulled from the Unix cron format
  • Consists of 5 fields separated by a space
  • An asterisk * represents running for every interval (ie, every minute, every day, etc)
  • Can be comma separated values in fields for a list of values
Introduzione ad Apache Airflow in Python

cron examples

0 12 * * *              # Run daily at noon
* * 25 2 *              # Run once per minute on February 25
0,15,30,45 * * * *      # Run every 15 minutes
Introduzione ad Apache Airflow in Python

Airflow scheduler presets

Preset:

  • @hourly
  • @daily
  • @weekly
  • @monthly
  • @yearly

cron equivalent:

  • 0 * * * *
  • 0 0 * * *
  • 0 0 * * 0
  • 0 0 1 * *
  • 0 0 1 1 *
1 https://airflow.apache.org/docs/stable/scheduler.html
Introduzione ad Apache Airflow in Python

Special presets

Airflow has two special schedule_interval presets:

  • None - Don't schedule ever, used for manually triggered DAGs
  • @once - Schedule only once
Introduzione ad Apache Airflow in Python

schedule_interval issues

When scheduling a DAG, Airflow will:

  • Use the start_date as the earliest possible value
  • Schedule the task at start_date + schedule_interval
'start_date': datetime(2020, 2, 25),
'schedule_interval': @daily

This means the earliest starting time to run the DAG is on February 26th, 2020

Introduzione ad Apache Airflow in Python

Let's practice!

Introduzione ad Apache Airflow in Python

Preparing Video For Download...