Slide 1

Slide 1 text

Introduction to Airflow Andrew Johnson 1

Slide 2

Slide 2 text

Background • Software Engineer on the Data Platform team at Stripe • Previously at Etsy and Explorys • Airflow runs all of our automated data processes • ~6k jobs • Mixture of MapReduce, Scalding, Redshift, and miscellaneous processes 2

Slide 3

Slide 3 text

Airflow is a Workflow Engine • Manage scheduling and running jobs and data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of jobs and recovering from failure 3

Slide 4

Slide 4 text

Airflow Concepts • Task - defined unit of work • Task instance - Individual run of some task • DAG - Group of tasks connected into a graph by dependency edges, runs on some frequency • DAG run - Individual run of some DAG 4

Slide 5

Slide 5 text

Airflow Architecture 5 Database Scheduler Celery Workers Web UI

Slide 6

Slide 6 text

Scheduler • Responsible for running tasks at the appropriate time • Loads task and DAG definitions and loops to process tasks • Starts DAG runs at appropriate times • Task state is retrieved and updated from the database • Tasks are scheduled with Celery to run 6

Slide 7

Slide 7 text

Database • Stores all Airflow state • All processes read/write from here • DB access through sqlalchemy 7

Slide 8

Slide 8 text

Celery and Workers • Distributed task queue • Backed by Redis/RabbitMQ • Scheduler enqueues tasks with Celery • Some worker executes that task 8

Slide 9

Slide 9 text

Web UI • Allows viewing the state of DAGs/tasks 9

Slide 10

Slide 10 text

Web UI 10

Slide 11

Slide 11 text

Defining Tasks • Core abstraction of Operators • Run shell scripts, Python code • Custom operators for specialized tasks • Jinja templates t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) 11

Slide 12

Slide 12 text

Defining DAGs • Define upstream/downstream relationships between tasks • t2.set_upstream(t1) • Every task is assigned to a DAG • dag = DAG('tutorial', default_args=default_args) • Magic import • from airflow import DAG 12

Slide 13

Slide 13 text

Resource Management • Create pools in the UI • Fixed number of slots to limit parallelism • Assign priorities to tasks • Priorities propagate upstream 13

Slide 14

Slide 14 text

What works well? • Task definitions are very flexible and extensible • Simple primitives support building complex data pipelines • UI has many useful visualizations • Inspecting DAG structure is straightforward 14

Slide 15

Slide 15 text

What doesn’t work well? • Stability in master has traditionally been not great • Many users maintain forks • Visibility when things go wrong could be better • Large DAGs tend to break the UI • High-frequency (< hourly intervals) can be flaky 15

Slide 16

Slide 16 text

Resources • https://airflow.incubator.apache.org/ • https://github.com/apache/incubator-airflow 16

Slide 17

Slide 17 text

Questions? 17