Introduction to Data Engineering
Vincent Vankrunkelsven
Data Engineer @ DataCamp
How to schedule?
cron
scheduling toolDirected Acyclic Graph
cron
# Create the DAG object dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *")
# Define operations start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag) ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag) ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag) enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag)
# Set up dependency flow start_cluster.set_downstream(ingest_customer_data) ingest_customer_data.set_downstream(enrich_customer_data) ingest_product_data.set_downstream(enrich_customer_data)
Introduction to Data Engineering