Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Datasets: What it is, and how it was made

Datasets: What it is, and how it was made

This is the slide deck I use to present during the Airflow Meetup Tokyo #3.

Conpass page (Japanese)
Meetup page (English)

Tzu-ping Chung

March 16, 2023
Tweet

More Decks by Tzu-ping Chung

Other Decks in Technology

Transcript

  1. # Upstream DAG. create = S3CreateObjectOperator( task_id="upload_data", s3_key=uri, data=transformed_data, )

    trigger = TriggerDagRunOperator( task_id="trigger_downstream", trigger_dag_id="use_data", ) create >> trigger # Downstream DAG. dag = DAG( dag_id="use_data", schedule=None, ) @dag.task def use_s3_data(): hook = S3Hook() hook.download_file(uri, ...) 5
  2. “DAG of DAGs” with DAG(dag_id="dag_of_dags", ...): trigger_create_dag = TriggerDagRunOperator( task_id="trigger_upstream",

    trigger_dag_id="create_data", wait_for_completion=True, ) trigger_use_dag = TriggerDagRunOperator( task_id="trigger_downstream", trigger_dag_id="use_data", wait_for_completion=True, ) trigger_create_dag >> trigger_use_dag 6
  3. Separation of Concerns When adding downstream DAGs “Why was this

    DAG run?” Mixing DAG and task dependencies 7
  4. 9

  5. # Upstream DAG. from common import dataset S3CreateObjectOperator( task_id="upload_data", s3_key=dataset.uri,

    data=transformed_data, outlets=[dataset], # <- ) # Downstream DAG. from common import dataset dag = DAG( dag_id="use_data", schedule=[dataset], # <- ) @dag.task def use_s3_data(): uri = dataset.uri h = S3Hook() h.download_file(uri, ...) 12