Slide 1

Slide 1 text

A NEWCOMER'S GUIDE TO ANDREW GODWIN // @andrewgodwin AIRFLOW'S ARCHITECTURE

Slide 2

Slide 2 text

Hi, I’m Andrew Godwin • Principal Engineer at • Also a Django core developer, ASGI author • Using Airflow since March 2021

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

High-Level Concepts What exactly is going on? The Good and the Bad Or, How I Learned To Stop Worrying And Love The Scheduler Problems, Fixes & The Future Where we go from here

Slide 5

Slide 5 text

Differences from things I have worked on? (An eclectic variety of web and backend systems)

Slide 6

Slide 6 text

"Real-time" versus batch The availability versus consistency tradeoff is different! Simple concepts, hard to master In Django, it's the ORM. In Airflow, scheduling. It's all still distributed systems Which is fortunate, after fifteen years of doing them

Slide 7

Slide 7 text

Airflow grew organically It started off as an internal ETL tool

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

DAG ➡ DagRun One per scheduled run, as the run starts Operator ➡ Task When you call an operator in a DAG Task ➡ TaskInstance When a Task needs to run as part of a DagRun

Slide 10

Slide 10 text

Scheduler Works out what TaskInstances need to run Executor Runs TaskInstances and records the results

Slide 11

Slide 11 text

Scheduler LocalExecutor Webserver Database DAG Files

Slide 12

Slide 12 text

Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers

Slide 13

Slide 13 text

The Executor runs inside the Scheduler Its logic, at least, and the tasks too for local ones

Slide 14

Slide 14 text

Everything talks to the database It's the single central point of coordination

Slide 15

Slide 15 text

Scheduler, Workers, Webserver All can be run in a high-availability pattern

Slide 16

Slide 16 text

Scheduler Works out what TaskInstances need to run Executor Runs TaskInstances and records the results

Slide 17

Slide 17 text

Scheduler Works out what TaskInstances need to run Executor Runs TaskInstances and records the results

Slide 18

Slide 18 text

Timing Dependencies Retries Concurrency Callbacks ...

Slide 19

Slide 19 text

Scheduler Works out what TaskInstances need to run Executor Runs TaskInstances and records the results

Slide 20

Slide 20 text

Celery or Kubernetes Our two main options, currently

Slide 21

Slide 21 text

Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers

Slide 22

Slide 22 text

Scheduler KubernetesExecutor Webserver Database DAG Files Kubernetes Task Pods

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Tasks are the core part of the model DAGs are more of a grouping/trigger mechanism

Slide 25

Slide 25 text

Very flexible runtime environments Airflow's strength, and its weakness

Slide 26

Slide 26 text

Airflow doesn't know what you're running This is both an advantage and a disadvantage.

Slide 27

Slide 27 text

What can we improve? Let's talk about The Future

Slide 28

Slide 28 text

More Async & Eventing Anything that involves waiting!

Slide 29

Slide 29 text

Scheduler CeleryExecutor Webserver Database DAG Files Redis/Queue Workers Triggerer

Slide 30

Slide 30 text

Removing Database Connections APIs scale a lot better!

Slide 31

Slide 31 text

I do like the database, though There's a lot of benefit in proven technology

Slide 32

Slide 32 text

Software Engineering is not just coding Any large-scale project needs documentation, architecture, and coordination

Slide 33

Slide 33 text

Maintenance & compatibility is crucial Anyone can write a tool - supporting it takes effort

Slide 34

Slide 34 text

Airflow is forged by people like you. Coding, documentation, triage, QA, support - it all needs doing.

Slide 35

Slide 35 text

Thanks. Andrew Godwin @andrewgodwin andrew.godwin@astronomer.io