Slide 1

Slide 1 text

Datasets What it is, and how it was made Tzu-ping Chung 1

Slide 2

Slide 2 text

Call me TP (uraunsjr.com) Taipei, Taiwan Astronomer Inc. Python packaging (pip & others) PyCon Taiwan (2–3 Sep.) 2

Slide 3

Slide 3 text

The scenario 3

Slide 4

Slide 4 text

Existing solutions SubDagOperator Giant DAG + task groups TriggerDagRunOperator 4

Slide 5

Slide 5 text

# Upstream DAG. create = S3CreateObjectOperator( task_id="upload_data", s3_key=uri, data=transformed_data, ) trigger = TriggerDagRunOperator( task_id="trigger_downstream", trigger_dag_id="use_data", ) create >> trigger # Downstream DAG. dag = DAG( dag_id="use_data", schedule=None, ) @dag.task def use_s3_data(): hook = S3Hook() hook.download_file(uri, ...) 5

Slide 6

Slide 6 text

“DAG of DAGs” with DAG(dag_id="dag_of_dags", ...): trigger_create_dag = TriggerDagRunOperator( task_id="trigger_upstream", trigger_dag_id="create_data", wait_for_completion=True, ) trigger_use_dag = TriggerDagRunOperator( task_id="trigger_downstream", trigger_dag_id="use_data", wait_for_completion=True, ) trigger_create_dag >> trigger_use_dag 6

Slide 7

Slide 7 text

Separation of Concerns When adding downstream DAGs “Why was this DAG run?” Mixing DAG and task dependencies 7

Slide 8

Slide 8 text

Remove the “Lines” Publish-subscribe pattern (PubSub) Upstream task emits an event DAG scheduled when event detected 8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

Background Design Rich scheduling Data lineage (task-level) AIP-48 10

Slide 11

Slide 11 text

Implementat Events Metadatabase stores published messages Dataset Manager subscribes DAGs on parse Scheduler loop calls Dataset Manager 11

Slide 12

Slide 12 text

# Upstream DAG. from common import dataset S3CreateObjectOperator( task_id="upload_data", s3_key=dataset.uri, data=transformed_data, outlets=[dataset], # <- ) # Downstream DAG. from common import dataset dag = DAG( dag_id="use_data", schedule=[dataset], # <- ) @dag.task def use_s3_data(): uri = dataset.uri h = S3Hook() h.download_file(uri, ...) 12

Slide 13

Slide 13 text

Future work External dataset inlets Dynamic lineage expansion Multi-paradigm DAG scheduling 13

Slide 14

Slide 14 text

Questions? 14