Save 37% off PRO during our Black Friday Sale! »

A Newcomer's Guide To Airflow's Architecture

A Newcomer's Guide To Airflow's Architecture

A talk I gave at Airflow Summit 2021.

077e9a0cb34fa3eba2699240c9509717?s=128

Andrew Godwin

July 12, 2021
Tweet

Transcript

  1. A NEWCOMER'S GUIDE TO ANDREW GODWIN // @andrewgodwin AIRFLOW'S ARCHITECTURE

  2. Hi, I’m Andrew Godwin • Principal Engineer at • Also

    a Django core developer, ASGI author • Using Airflow since March 2021
  3. None
  4. High-Level Concepts What exactly is going on? The Good and

    the Bad Or, How I Learned To Stop Worrying And Love The Scheduler Problems, Fixes & The Future Where we go from here
  5. Differences from things I have worked on? (An eclectic variety

    of web and backend systems)
  6. "Real-time" versus batch The availability versus consistency tradeoff is different!

    Simple concepts, hard to master In Django, it's the ORM. In Airflow, scheduling. It's all still distributed systems Which is fortunate, after fifteen years of doing them
  7. Airflow grew organically It started off as an internal ETL

    tool
  8. None
  9. DAG ➡ DagRun One per scheduled run, as the run

    starts Operator ➡ Task When you call an operator in a DAG Task ➡ TaskInstance When a Task needs to run as part of a DagRun
  10. Scheduler Works out what TaskInstances need to run Executor Runs

    TaskInstances and records the results
  11. Scheduler LocalExecutor Webserver Database DAG Files

  12. Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers

  13. The Executor runs inside the Scheduler Its logic, at least,

    and the tasks too for local ones
  14. Everything talks to the database It's the single central point

    of coordination
  15. Scheduler, Workers, Webserver All can be run in a high-availability

    pattern
  16. Scheduler Works out what TaskInstances need to run Executor Runs

    TaskInstances and records the results
  17. Scheduler Works out what TaskInstances need to run Executor Runs

    TaskInstances and records the results
  18. Timing Dependencies Retries Concurrency Callbacks ...

  19. Scheduler Works out what TaskInstances need to run Executor Runs

    TaskInstances and records the results
  20. Celery or Kubernetes Our two main options, currently

  21. Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers

  22. Scheduler KubernetesExecutor Webserver Database DAG Files Kubernetes Task Pods

  23. None
  24. Tasks are the core part of the model DAGs are

    more of a grouping/trigger mechanism
  25. Very flexible runtime environments Airflow's strength, and its weakness

  26. Airflow doesn't know what you're running This is both an

    advantage and a disadvantage.
  27. What can we improve? Let's talk about The Future

  28. More Async & Eventing Anything that involves waiting!

  29. Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers Triggerer

  30. Removing Database Connections APIs scale a lot better!

  31. I do like the database, though There's a lot of

    benefit in proven technology
  32. Software Engineering is not just coding Any large-scale project needs

    documentation, architecture, and coordination
  33. Maintenance & compatibility is crucial Anyone can write a tool

    - supporting it takes effort
  34. Airflow is forged by people like you. Coding, documentation, triage,

    QA, support - it all needs doing.
  35. Thanks. Andrew Godwin @andrewgodwin andrew.godwin@astronomer.io