Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Datasets: What it is, and how it was made

Datasets: What it is, and how it was made

This is the slide deck I use to present during the Airflow Meetup Tokyo #3.

Conpass page (Japanese)
Meetup page (English)

Tzu-ping Chung

March 16, 2023
Tweet

More Decks by Tzu-ping Chung

Other Decks in Technology

Transcript

  1. Datasets
    What it is, and how it was made
    Tzu-ping Chung
    1

    View Slide

  2. Call me TP (uraunsjr.com)
    Taipei, Taiwan
    Astronomer Inc.
    Python packaging (pip & others)
    PyCon Taiwan (2–3 Sep.)
    2

    View Slide

  3. The scenario
    3

    View Slide

  4. Existing solutions
    SubDagOperator
    Giant DAG + task groups
    TriggerDagRunOperator
    4

    View Slide

  5. # Upstream DAG.
    create = S3CreateObjectOperator(
    task_id="upload_data",
    s3_key=uri,
    data=transformed_data,
    )
    trigger = TriggerDagRunOperator(
    task_id="trigger_downstream",
    trigger_dag_id="use_data",
    )
    create >> trigger
    # Downstream DAG.
    dag = DAG(
    dag_id="use_data",
    schedule=None,
    )
    @dag.task
    def use_s3_data():
    hook = S3Hook()
    hook.download_file(uri, ...)
    5

    View Slide

  6. “DAG of DAGs”
    with DAG(dag_id="dag_of_dags", ...):
    trigger_create_dag = TriggerDagRunOperator(
    task_id="trigger_upstream",
    trigger_dag_id="create_data",
    wait_for_completion=True,
    )
    trigger_use_dag = TriggerDagRunOperator(
    task_id="trigger_downstream",
    trigger_dag_id="use_data",
    wait_for_completion=True,
    )
    trigger_create_dag >> trigger_use_dag
    6

    View Slide

  7. Separation of Concerns
    When adding downstream DAGs
    “Why was this DAG run?”
    Mixing DAG and task dependencies
    7

    View Slide

  8. Remove the “Lines”
    Publish-subscribe pattern (PubSub)
    Upstream task emits an event
    DAG scheduled when event detected
    8

    View Slide

  9. 9

    View Slide

  10. Background Design
    Rich scheduling
    Data lineage (task-level)
    AIP-48
    10

    View Slide

  11. Implementat Events
    Metadatabase stores published messages
    Dataset Manager subscribes DAGs on parse
    Scheduler loop calls Dataset Manager
    11

    View Slide

  12. # Upstream DAG.
    from common import dataset
    S3CreateObjectOperator(
    task_id="upload_data",
    s3_key=dataset.uri,
    data=transformed_data,
    outlets=[dataset], # )
    # Downstream DAG.
    from common import dataset
    dag = DAG(
    dag_id="use_data",
    schedule=[dataset], # )
    @dag.task
    def use_s3_data():
    uri = dataset.uri
    h = S3Hook()
    h.download_file(uri, ...)
    12

    View Slide

  13. Future work
    External dataset inlets
    Dynamic lineage expansion
    Multi-paradigm DAG scheduling
    13

    View Slide

  14. Questions?
    14

    View Slide