Introduction to Apache Airflow

Introduction to Airﬂow Andrew Johnson 1

Background • Software Engineer on the Data Platform team at
Stripe • Previously at Etsy and Explorys • Airﬂow runs all of our automated data processes • ~6k jobs • Mixture of MapReduce, Scalding, Redshift, and miscellaneous processes 2

Airﬂow is a Workﬂow Engine • Manage scheduling and running
jobs and data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of jobs and recovering from failure 3

Airﬂow Concepts • Task - deﬁned unit of work •
Task instance - Individual run of some task • DAG - Group of tasks connected into a graph by dependency edges, runs on some frequency • DAG run - Individual run of some DAG 4

Airﬂow Architecture 5 Database Scheduler Celery Workers Web UI

Scheduler • Responsible for running tasks at the appropriate time
• Loads task and DAG deﬁnitions and loops to process tasks • Starts DAG runs at appropriate times • Task state is retrieved and updated from the database • Tasks are scheduled with Celery to run 6

Database • Stores all Airﬂow state • All processes read/write
from here • DB access through sqlalchemy 7

Celery and Workers • Distributed task queue • Backed by
Redis/RabbitMQ • Scheduler enqueues tasks with Celery • Some worker executes that task 8

Web UI • Allows viewing the state of DAGs/tasks 9

Web UI 10

Deﬁning Tasks • Core abstraction of Operators • Run shell
scripts, Python code • Custom operators for specialized tasks • Jinja templates t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) 11

Deﬁning DAGs • Deﬁne upstream/downstream relationships between tasks • t2.set_upstream(t1)
• Every task is assigned to a DAG • dag = DAG('tutorial', default_args=default_args) • Magic import • from airflow import DAG 12

Resource Management • Create pools in the UI • Fixed
number of slots to limit parallelism • Assign priorities to tasks • Priorities propagate upstream 13

What works well? • Task deﬁnitions are very ﬂexible and
extensible • Simple primitives support building complex data pipelines • UI has many useful visualizations • Inspecting DAG structure is straightforward 14

What doesn’t work well? • Stability in master has traditionally
been not great • Many users maintain forks • Visibility when things go wrong could be better • Large DAGs tend to break the UI • High-frequency (< hourly intervals) can be ﬂaky 15

Resources • https://airﬂow.incubator.apache.org/ • https://github.com/apache/incubator-airﬂow 16

Questions? 17

Introduction to Apache Airflow

Introduction to Apache Airflow

Andrew Johnson

More Decks by Andrew Johnson

Other Decks in Technology

Featured

Transcript

Introduction to Airﬂow Andrew Johnson 1

Background • Software Engineer on the Data Platform team at

Airﬂow is a Workﬂow Engine • Manage scheduling and running

Airﬂow Concepts • Task - deﬁned unit of work •

Airﬂow Architecture 5 Database Scheduler Celery Workers Web UI

Scheduler • Responsible for running tasks at the appropriate time

Database • Stores all Airﬂow state • All processes read/write

Celery and Workers • Distributed task queue • Backed by

Web UI • Allows viewing the state of DAGs/tasks 9

Web UI 10

Deﬁning Tasks • Core abstraction of Operators • Run shell

Deﬁning DAGs • Deﬁne upstream/downstream relationships between tasks • t2.set_upstream(t1)

Resource Management • Create pools in the UI • Fixed

What works well? • Task deﬁnitions are very ﬂexible and

What doesn’t work well? • Stability in master has traditionally

Resources • https://airﬂow.incubator.apache.org/ • https://github.com/apache/incubator-airﬂow 16

Questions? 17