Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2016 - Matt Davis - A Practical Introduction to Airflow

PyBay
August 20, 2016

2016 - Matt Davis - A Practical Introduction to Airflow

Description
Moving data through transformations and from one place to another is a big part of data science/eng. We’ve been using Airflow for several months at Clover Health and have learned a lot about its strengths and weaknesses. We will use this talk to give a practical introduction to Airflow that gives people the information they need to decide whether Airflow is right for them and how to get started.

Abstract
Airflow is a popular pipeline orchestration tool for Python that allows users to configure complex (or simple!) multi-system workflows that are executed in parallel across any number of workers. A single pipeline might contain bash, Python, and SQL operations. With dependencies specified between tasks, Airflow knows which ones it can run in parallel and which ones must run after others. Airflow is written in Python and users can add their own operators with custom functionality, doing anything Python can do.

At Clover Health, we’ve been pushing Airflow’s limits, digging into the source code, and contributing patches upstream. In this talk, we’ll cover the basics of Airflow so you can use what we’ve learned to start your Airflow journey on the right foot. This talk aims to answer questions such as: What is Airflow useful for? How do I get started? What do I need to know that’s not in the docs?

Bio
I have been a scientific Python developer since 2008. I’ve worked in atmospheric science, astronomy, urban planning, web applications, and healthcare. I maintain several open source Python libraries and am currently a data engineer at Clover Health.

https://youtu.be/4iDjegukrkI

PyBay

August 20, 2016
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. A Practical Introduction to Airflow - PyBay 2016 2 About

    Clover • Health Insurance Company • 100% Python • Based in SF • Hiring! About Matt • Data Platform Engineering @ Clover • Pythonista since 2008 • @jiffyclub
  2. A Practical Introduction to Airflow - PyBay 2016 1. Where

    does Airflow come from? 2. What does Airflow do? 3. UI 4. Deployment 5. Pipeline Construction 6. Tip, Tricks, and FYIs 3
  3. A Practical Introduction to Airflow - PyBay 2016 Airflow Origins

    • Originally from Airbnb in 2015 • Joined Apache incubator in early 2016 5
  4. A Practical Introduction to Airflow - PyBay 2016 Components 22

    METADATA DB WEB SERVER SCHEDULER WORKER WORKER WORKER WORKER CELERY
  5. A Practical Introduction to Airflow - PyBay 2016 Components 23

    METADATA DB WEB SERVER SCHEDULER + WORKERS
  6. A Practical Introduction to Airflow - PyBay 2016 import airflow.models

    as af_models DAG = af_models.DAG( dag_id='my_dag', start_date=datetime(2016, 8, 13), schedule_interval='0 10 * * *') The DAG 25
  7. A Practical Introduction to Airflow - PyBay 2016 • default_args

    • max_active_runs • concurrency Useful DAG Arguments 26
  8. A Practical Introduction to Airflow - PyBay 2016 import airflow.operators

    as af_op first_task = af_op.PythonOperator( task_id='my_task', python_callable=module.function, dag=DAG) A Task 27
  9. A Practical Introduction to Airflow - PyBay 2016 second_task =

    af_op.PythonOperator( task_id='my_second_task', python_callable=module.another_func, dag=DAG) second_task.set_upstream(first_task) Another Task 28
  10. A Practical Introduction to Airflow - PyBay 2016 • retries

    • pool • queue (Celery only) • execution_timeout • trigger_rule • Args for Python callables • Environment variables • Template variables Useful Task Arguments 29
  11. A Practical Introduction to Airflow - PyBay 2016 for f

    in files: task = af_op.PythonOperator( task_id='parsing_{}'.format(f), python_callable=parse_file, op_kwargs={'fname': f}, dag=DAG) task.set_upstream(first_task) Building Pipelines 30
  12. 32 A Practical Introduction to Airflow - PyBay 2016 Executor

    Types • CeleryExecutor • SequentialExecutor • LocalExecutor • MesosExecutor
  13. 33 A Practical Introduction to Airflow - PyBay 2016 Local

    Debugging • SequentialExecutor • import pdb; pdb.set_trace() • airflow test
  14. 34 A Practical Introduction to Airflow - PyBay 2016 Local

    Pipeline Testing • start_date=some_past_date • schedule_interval='@once’ • Delete DAG run or task-instances to rerun
  15. 35 A Practical Introduction to Airflow - PyBay 2016 Airflow

    Time 2016-08-13T03:00:00 2016-08-13T15:00:00 2016-08-14T03:00:00 The run that starts here… …has this “execution date”.
  16. 36 A Practical Introduction to Airflow - PyBay 2016 Separate

    Logic • Don’t mix task logic and Airflow • Be able to test & run without involving Airflow
  17. 37 A Practical Introduction to Airflow - PyBay 2016 Deploying

    New Code • Airflow scheduler & workers are long-running Python processes • Have to restart them to pick up new code changes • (except in DAG files)
  18. 38 A Practical Introduction to Airflow - PyBay 2016 Deploying

    New Code • airflow scheduler --num_runs • CELERYD_MAX_TASKS_PER_CHILD
  19. A Practical Introduction to Airflow - PyBay 2016 Why Clover

    Chose Airflow • Written in Python • Nice UI • Programmatic pipeline construction • Can run complex pipelines 40
  20. A Practical Introduction to Airflow - PyBay 2016 Useful Stuff

    Airflow Docs http://airflow.incubator.apache.org Clover Health https://www.cloverhealth.com Matt Davis aka @jiffyclub https://penandpants.com 41