Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Airflow

Introduction to Apache Airflow

An overview of Apache Airflow

Andrew Johnson

January 30, 2017
Tweet

More Decks by Andrew Johnson

Other Decks in Technology

Transcript

  1. Background • Software Engineer on the Data Platform team at

    Stripe • Previously at Etsy and Explorys • Airflow runs all of our automated data processes • ~6k jobs • Mixture of MapReduce, Scalding, Redshift, and miscellaneous processes 2
  2. Airflow is a Workflow Engine • Manage scheduling and running

    jobs and data pipelines • Ensures jobs are ordered correctly based on dependencies • Manage allocation of scarce resources • Provides mechanism for tracking the state of jobs and recovering from failure 3
  3. Airflow Concepts • Task - defined unit of work •

    Task instance - Individual run of some task • DAG - Group of tasks connected into a graph by dependency edges, runs on some frequency • DAG run - Individual run of some DAG 4
  4. Scheduler • Responsible for running tasks at the appropriate time

    • Loads task and DAG definitions and loops to process tasks • Starts DAG runs at appropriate times • Task state is retrieved and updated from the database • Tasks are scheduled with Celery to run 6
  5. Database • Stores all Airflow state • All processes read/write

    from here • DB access through sqlalchemy 7
  6. Celery and Workers • Distributed task queue • Backed by

    Redis/RabbitMQ • Scheduler enqueues tasks with Celery • Some worker executes that task 8
  7. Defining Tasks • Core abstraction of Operators • Run shell

    scripts, Python code • Custom operators for specialized tasks • Jinja templates t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) 11
  8. Defining DAGs • Define upstream/downstream relationships between tasks • t2.set_upstream(t1)

    • Every task is assigned to a DAG • dag = DAG('tutorial', default_args=default_args) • Magic import • from airflow import DAG 12
  9. Resource Management • Create pools in the UI • Fixed

    number of slots to limit parallelism • Assign priorities to tasks • Priorities propagate upstream 13
  10. What works well? • Task definitions are very flexible and

    extensible • Simple primitives support building complex data pipelines • UI has many useful visualizations • Inspecting DAG structure is straightforward 14
  11. What doesn’t work well? • Stability in master has traditionally

    been not great • Many users maintain forks • Visibility when things go wrong could be better • Large DAGs tend to break the UI • High-frequency (< hourly intervals) can be flaky 15