Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Newcomer's Guide To Airflow's Architecture

A Newcomer's Guide To Airflow's Architecture

A talk I gave at Airflow Summit 2021.

Andrew Godwin

July 12, 2021
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. A NEWCOMER'S GUIDE TO
    ANDREW GODWIN // @andrewgodwin
    AIRFLOW'S
    ARCHITECTURE

    View full-size slide

  2. Hi, I’m
    Andrew Godwin
    • Principal Engineer at
    • Also a Django core developer, ASGI author
    • Using Airflow since March 2021

    View full-size slide

  3. High-Level Concepts
    What exactly is going on?
    The Good and the Bad
    Or, How I Learned To Stop Worrying And Love The Scheduler
    Problems, Fixes & The Future
    Where we go from here

    View full-size slide

  4. Differences from things I have worked on?
    (An eclectic variety of web and backend systems)

    View full-size slide

  5. "Real-time" versus batch
    The availability versus consistency tradeoff is different!
    Simple concepts, hard to master
    In Django, it's the ORM. In Airflow, scheduling.
    It's all still distributed systems
    Which is fortunate, after fifteen years of doing them

    View full-size slide

  6. Airflow grew organically
    It started off as an internal ETL tool

    View full-size slide

  7. DAG ➡ DagRun
    One per scheduled run, as the run starts
    Operator ➡ Task
    When you call an operator in a DAG
    Task ➡ TaskInstance
    When a Task needs to run as part of a DagRun

    View full-size slide

  8. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View full-size slide

  9. Scheduler
    LocalExecutor
    Webserver
    Database
    DAG Files

    View full-size slide

  10. Scheduler
    CeleryExecutor
    Webserver
    Database
    DAG Files
    Redis/Queue
    Workers

    View full-size slide

  11. The Executor runs inside the Scheduler
    Its logic, at least, and the tasks too for local ones

    View full-size slide

  12. Everything talks to the database
    It's the single central point of coordination

    View full-size slide

  13. Scheduler, Workers, Webserver
    All can be run in a high-availability pattern

    View full-size slide

  14. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View full-size slide

  15. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View full-size slide

  16. Timing
    Dependencies
    Retries
    Concurrency
    Callbacks
    ...

    View full-size slide

  17. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View full-size slide

  18. Celery or Kubernetes
    Our two main options, currently

    View full-size slide

  19. Scheduler
    CeleryExecutor
    Webserver
    Database
    DAG Files
    Redis/Queue
    Workers

    View full-size slide

  20. Scheduler
    KubernetesExecutor
    Webserver
    Database
    DAG Files
    Kubernetes
    Task Pods

    View full-size slide

  21. Tasks are the core part of the model
    DAGs are more of a grouping/trigger mechanism

    View full-size slide

  22. Very flexible runtime environments
    Airflow's strength, and its weakness

    View full-size slide

  23. Airflow doesn't know what you're running
    This is both an advantage and a disadvantage.

    View full-size slide

  24. What can we improve?
    Let's talk about The Future

    View full-size slide

  25. More Async & Eventing
    Anything that involves waiting!

    View full-size slide

  26. Scheduler
    CeleryExecutor
    Webserver
    Database
    DAG Files
    Redis/Queue
    Workers
    Triggerer

    View full-size slide

  27. Removing Database Connections
    APIs scale a lot better!

    View full-size slide

  28. I do like the database, though
    There's a lot of benefit in proven technology

    View full-size slide

  29. Software Engineering is not just coding
    Any large-scale project needs documentation, architecture, and coordination

    View full-size slide

  30. Maintenance & compatibility is crucial
    Anyone can write a tool - supporting it takes effort

    View full-size slide

  31. Airflow is forged by people like you.
    Coding, documentation, triage, QA, support - it all needs doing.

    View full-size slide

  32. Thanks.
    Andrew Godwin
    @andrewgodwin
    [email protected]

    View full-size slide