Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Newcomer's Guide To Airflow's Architecture

A Newcomer's Guide To Airflow's Architecture

A talk I gave at Airflow Summit 2021.

Andrew Godwin

July 12, 2021
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. A NEWCOMER'S GUIDE TO
    ANDREW GODWIN // @andrewgodwin
    AIRFLOW'S
    ARCHITECTURE

    View Slide

  2. Hi, I’m
    Andrew Godwin
    • Principal Engineer at
    • Also a Django core developer, ASGI author
    • Using Airflow since March 2021

    View Slide

  3. View Slide

  4. High-Level Concepts
    What exactly is going on?
    The Good and the Bad
    Or, How I Learned To Stop Worrying And Love The Scheduler
    Problems, Fixes & The Future
    Where we go from here

    View Slide

  5. Differences from things I have worked on?
    (An eclectic variety of web and backend systems)

    View Slide

  6. "Real-time" versus batch
    The availability versus consistency tradeoff is different!
    Simple concepts, hard to master
    In Django, it's the ORM. In Airflow, scheduling.
    It's all still distributed systems
    Which is fortunate, after fifteen years of doing them

    View Slide

  7. Airflow grew organically
    It started off as an internal ETL tool

    View Slide

  8. View Slide

  9. DAG ➡ DagRun
    One per scheduled run, as the run starts
    Operator ➡ Task
    When you call an operator in a DAG
    Task ➡ TaskInstance
    When a Task needs to run as part of a DagRun

    View Slide

  10. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View Slide

  11. Scheduler
    LocalExecutor
    Webserver
    Database
    DAG Files

    View Slide

  12. Scheduler
    CeleryExecutor
    Webserver
    Database
    DAG Files
    Redis/Queue
    Workers

    View Slide

  13. The Executor runs inside the Scheduler
    Its logic, at least, and the tasks too for local ones

    View Slide

  14. Everything talks to the database
    It's the single central point of coordination

    View Slide

  15. Scheduler, Workers, Webserver
    All can be run in a high-availability pattern

    View Slide

  16. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View Slide

  17. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View Slide

  18. Timing
    Dependencies
    Retries
    Concurrency
    Callbacks
    ...

    View Slide

  19. Scheduler
    Works out what
    TaskInstances need to run
    Executor
    Runs TaskInstances and
    records the results

    View Slide

  20. Celery or Kubernetes
    Our two main options, currently

    View Slide

  21. Scheduler
    CeleryExecutor
    Webserver
    Database
    DAG Files
    Redis/Queue
    Workers

    View Slide

  22. Scheduler
    KubernetesExecutor
    Webserver
    Database
    DAG Files
    Kubernetes
    Task Pods

    View Slide

  23. View Slide

  24. Tasks are the core part of the model
    DAGs are more of a grouping/trigger mechanism

    View Slide

  25. Very flexible runtime environments
    Airflow's strength, and its weakness

    View Slide

  26. Airflow doesn't know what you're running
    This is both an advantage and a disadvantage.

    View Slide

  27. What can we improve?
    Let's talk about The Future

    View Slide

  28. More Async & Eventing
    Anything that involves waiting!

    View Slide

  29. Scheduler
    CeleryExecutor
    Webserver
    Database
    DAG Files
    Redis/Queue
    Workers
    Triggerer

    View Slide

  30. Removing Database Connections
    APIs scale a lot better!

    View Slide

  31. I do like the database, though
    There's a lot of benefit in proven technology

    View Slide

  32. Software Engineering is not just coding
    Any large-scale project needs documentation, architecture, and coordination

    View Slide

  33. Maintenance & compatibility is crucial
    Anyone can write a tool - supporting it takes effort

    View Slide

  34. Airflow is forged by people like you.
    Coding, documentation, triage, QA, support - it all needs doing.

    View Slide

  35. Thanks.
    Andrew Godwin
    @andrewgodwin
    [email protected]

    View Slide