Developing elegant workflows with Apache Airflow

Developing elegant workflows with Apache Airflow

Every time a new batch of data comes in, you start a set of tasks. Some tasks can run in parallel, some must run in a sequence, perhaps on a number of different machines. That’s a workflow.

Did you ever draw a block diagram of your workflow? Imagine you could bring that diagram to life and actually run it as it looks on the whiteboard. With Airflow you can just about do that.

Apache Airflow is an open-source Python tool for orchestrating data processing pipelines. In each workflow tasks are arranged into a directed acyclic graph (DAG). Shape of this graph decides the overall logic of the workflow. A DAG can have many branches and you can decide which of them to follow and which to skip at execution time.

This creates a resilient design because each task can be retried multiple times if an error occurs. Airflow can even be stopped entirely and running workflows will resume by restarting the last unfinished task. Logs for each task are stored separately and are easily accessible through a friendly web UI.

In my talk I will go over basic Airflow concepts and through examples demonstrate how easy it is to define your own workflows in Python code. We’ll also go over ways to extend Airflow by adding custom task operators, sensors and plugins.

Talk by Michal Karzynski

6082203cf72bc1220f6b7984bfbbad11?s=128

Michał Karzyński

July 13, 2017
Tweet

Transcript

  1. 2.

    ABOUT ME • Michał Karzyński (@postrational) • Full stack geek

    (Python, JavaScript and Linux) • I blog at http://michal.karzynski.pl • I’m a tech lead at and a consultant at .com
  2. 4.

    WHAT IS A WORKFLOW? • sequence of tasks • started

    on a schedule or triggered by an event • frequently used to handle big data processing pipelines
  3. 6.

    EXAMPLES EVERYWHERE • Extract, Transform, Load (ETL) • data warehousing

    • A/B testing • anomaly detection • training recommender systems • orchestrating automated testing • processing genomes every time a new genome file is published
  4. 8.

    APACHE AIRFLOW • open source, written in Python • developed

    originally by Airbnb • 280+ contributors, 4000+ commits, 5000+ stars • used by Intel, Airbnb, Yahoo, PayPal, WePay, Stripe, Blue Yonder… Apache Airflow
  5. 9.

    APACHE AIRFLOW Apache Airflow 1. Framework to write your workflows

    2. Scalable executor and scheduler 3. Rich web UI for monitoring and logs
  6. 10.
  7. 11.

    WHAT FLOWS IN A WORKFLOW? Tasks make decisions based on:

    • workflow input • upstream task output Information flows downstream like a river. photo by Steve Byrne
  8. 15.
  9. 16.

    AIRFLOW CONCEPTS: DAGS • DAG - Directed Acyclic Graph •

    Define workflow logic as shape of the graph
  10. 17.

    def print_hello(): return 'Hello world!' dag = DAG('hello_world', description='Simple tutorial

    DAG', schedule_interval='0 12 * * *', start_date=datetime.datetime(2017, 7, 13), catchup=False) with dag: dummy_task = DummyOperator(task_id='dummy', retries=3) hello_task = PythonOperator(task_id='hello', python_callable=print_hello) dummy_task >> hello_task
  11. 18.

    AIRFLOW CONCEPTS: OPERATOR • definition of a single task •

    will retry automatically • should be idempotent • Python class with an execute method
  12. 19.

    class MyFirstOperator(BaseOperator): @apply_defaults def __init__(self, my_param, *args, **kwargs): self.task_param =

    my_param super(MyFirstOperator, self).__init__(*args, **kwargs) def execute(self, context): log.info('Hello World!') log.info('my_param: %s', self.task_param) with dag: my_first_task = MyFirstOperator(my_param='This is a test.', task_id='my_task')
  13. 20.

    AIRFLOW CONCEPTS: SENSORS • long running task • useful for

    monitoring external processes • Python class with a poke method • poke will be called repeatedly until it returns True
  14. 21.

    class MyFirstSensor(BaseSensorOperator): def poke(self, context): current_minute = datetime.now().minute if current_minute

    % 3 != 0: log.info('Current minute (%s) not is divisible by 3, ' 'sensor will retry.', current_minute) return False log.info('Current minute (%s) is divisible by 3, ' 'sensor finishing.', current_minute) task_instance = context['task_instance'] task_instance.xcom_push('sensors_minute', current_minute) return True
  15. 22.

    AIRFLOW CONCEPTS: XCOM • means of communication between task instances

    • saved in database as a pickled object • best suited for small pieces of data (ids, etc.)
  16. 23.

    def execute(self, context): ... task_instance = context['task_instance'] task_instance.xcom_push('sensors_minute', current_minute) def

    execute(self, context): ... task_instance = context['task_instance'] sensors_minute = task_instance.xcom_pull('sensor_task_id', key='sensors_minute') log.info('Valid minute as determined by sensor: %s', sensors_minute) XCom Push: XCom Pull:
  17. 24.

    def execute(self, context): log.info('XCom: Scanning upstream tasks for Database IDs')

    task_instance = context['task_instance'] upstream_tasks = self.get_flat_relatives(upstream=True) upstream_task_ids = [task.task_id for task in upstream_tasks] upstream_database_ids = task_instance.xcom_pull(task_ids=upstream_task_ids, key='db_id') log.info('XCom: Found the following Database IDs: %s', upstream_database_ids) SCAN FOR INFORMATION UPSTREAM
  18. 25.

    REUSABLE OPERATORS • loosely coupled • with few necessary XCom

    parameters • most parameters are optional • sane defaults • will adapt if information appears upstream
  19. 27.

    CONDITIONAL EXECUTION: BRANCH OPERATOR • decide which branch of the

    graph to follow • all others will be skipped
  20. 28.

    CONDITIONAL EXECUTION: BRANCH OPERATOR def choose(): return 'first' with dag:

    branching = BranchPythonOperator(task_id='branching', python_callable=choose) branching >> DummyOperator(task_id='first') branching >> DummyOperator(task_id='second')
  21. 29.

    CONDITIONAL EXECUTION: AIRFLOW SKIP EXCEPTION • raise AirflowSkipException to skip

    execution of current task • all other exceptions cause retries and ultimately the task to fail • puts a dam in the river def execute(self, context): ... if not conditions_met: log.info('Conditions not met, skipping.') raise AirflowSkipException()
  22. 30.

    CONDITIONAL EXECUTION: 
 TRIGGER RULES • decide when a task

    is triggered • defaults to all_success • all_done - opens dam from downstream task class TriggerRule(object): ALL_SUCCESS = 'all_success' ALL_FAILED = 'all_failed' ALL_DONE = 'all_done' ONE_SUCCESS = 'one_success' ONE_FAILED = 'one_failed' DUMMY = 'dummy'
  23. 31.

    BASH COMMANDS AND TEMPLATES • execute Bash command on Worker

    node • use Jinja templates to generate a Bash script • define macros - Python functions used in templates
  24. 32.

    BASH COMMANDS AND TEMPLATES templated_command = """ {% for i

    in range(5) %} echo "execution date: {{ ds }}" echo "{{ params.my_param }}" {% endfor %} """ BashOperator( task_id='templated', bash_command=templated_command, params={'my_param': 'Value I passed in'}, dag=dag)
  25. 33.

    AIRFLOW PLUGINS • Add many types of components used by

    Airflow • Subclass of AirflowPlugin • File placed in AIRFLOW_HOME/plugins
  26. 34.

    AIRFLOW PLUGINS class MyPlugin(AirflowPlugin): name = "my_plugin" # A list

    of classes derived from BaseOperator operators = [] # A list of menu links (flask_admin.base.MenuLink) menu_links = [] # A list of objects created from a class derived from flask_admin.BaseView admin_views = [] # A list of Blueprint object created from flask.Blueprint flask_blueprints = [] # A list of classes derived from BaseHook (connection clients) hooks = [] # A list of classes derived from BaseExecutor (e.g. MesosExecutor) executors = []