Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing elegant workflows with Apache Airflow

Developing elegant workflows with Apache Airflow

Every time a new batch of data comes in, you start a set of tasks. Some tasks can run in parallel, some must run in a sequence, perhaps on a number of different machines. That’s a workflow.

Did you ever draw a block diagram of your workflow? Imagine you could bring that diagram to life and actually run it as it looks on the whiteboard. With Airflow you can just about do that.

Apache Airflow is an open-source Python tool for orchestrating data processing pipelines. In each workflow tasks are arranged into a directed acyclic graph (DAG). Shape of this graph decides the overall logic of the workflow. A DAG can have many branches and you can decide which of them to follow and which to skip at execution time.

This creates a resilient design because each task can be retried multiple times if an error occurs. Airflow can even be stopped entirely and running workflows will resume by restarting the last unfinished task. Logs for each task are stored separately and are easily accessible through a friendly web UI.

In my talk I will go over basic Airflow concepts and through examples demonstrate how easy it is to define your own workflows in Python code. We’ll also go over ways to extend Airflow by adding custom task operators, sensors and plugins.

Talk by Michal Karzynski

Michał Karzyński

July 13, 2017
Tweet

More Decks by Michał Karzyński

Other Decks in Programming

Transcript

  1. DEVELOPING ELEGANT WORKFLOWS
    with Apache Airflow
    Michał Karzyński • EuroPython 2017

    View Slide

  2. ABOUT ME
    • Michał Karzyński (@postrational)
    • Full stack geek (Python, JavaScript and Linux)
    • I blog at http://michal.karzynski.pl
    • I’m a tech lead at and a consultant at .com

    View Slide

  3. LET’S TALK ABOUT WORKFLOWS

    View Slide

  4. WHAT IS A WORKFLOW?
    • sequence of tasks
    • started on a schedule or triggered by an event
    • frequently used to handle big data processing pipelines

    View Slide

  5. A TYPICAL WORKFLOW

    View Slide

  6. EXAMPLES EVERYWHERE
    • Extract, Transform, Load (ETL)
    • data warehousing
    • A/B testing
    • anomaly detection
    • training recommender systems
    • orchestrating automated testing
    • processing genomes every time a new genome file is published

    View Slide

  7. WORKFLOW MANAGERS
    Airflow Azkaban Taskflow
    Luigi
    Oozie

    View Slide

  8. APACHE AIRFLOW
    • open source, written in Python
    • developed originally by Airbnb
    • 280+ contributors, 4000+ commits, 5000+ stars
    • used by Intel, Airbnb, Yahoo, PayPal, WePay, Stripe, Blue Yonder…
    Apache Airflow

    View Slide

  9. APACHE AIRFLOW
    Apache Airflow
    1. Framework to write your workflows
    2. Scalable executor and scheduler
    3. Rich web UI for monitoring and logs

    View Slide

  10. Demo

    View Slide

  11. WHAT FLOWS IN A WORKFLOW?
    Tasks make decisions based on:
    • workflow input
    • upstream task output
    Information flows downstream
    like a river.
    photo by Steve Byrne

    View Slide

  12. SOURCE AND TRIBUTARIES

    View Slide

  13. DISTRIBUTARIES AND DELTAS

    View Slide

  14. BRANCHES?
    Directed Acyclic Graph (DAG)

    View Slide

  15. FLOW

    View Slide

  16. AIRFLOW CONCEPTS: DAGS
    • DAG - Directed Acyclic Graph
    • Define workflow logic as shape of the graph

    View Slide

  17. def print_hello():
    return 'Hello world!'
    dag = DAG('hello_world', description='Simple tutorial DAG',
    schedule_interval='0 12 * * *',
    start_date=datetime.datetime(2017, 7, 13), catchup=False)
    with dag:
    dummy_task = DummyOperator(task_id='dummy', retries=3)
    hello_task = PythonOperator(task_id='hello', python_callable=print_hello)
    dummy_task >> hello_task

    View Slide

  18. AIRFLOW CONCEPTS: OPERATOR
    • definition of a single task
    • will retry automatically
    • should be idempotent
    • Python class with an execute method

    View Slide

  19. class MyFirstOperator(BaseOperator):
    @apply_defaults
    def __init__(self, my_param, *args, **kwargs):
    self.task_param = my_param
    super(MyFirstOperator, self).__init__(*args, **kwargs)
    def execute(self, context):
    log.info('Hello World!')
    log.info('my_param: %s', self.task_param)
    with dag:
    my_first_task = MyFirstOperator(my_param='This is a test.',
    task_id='my_task')

    View Slide

  20. AIRFLOW CONCEPTS: SENSORS
    • long running task
    • useful for monitoring external processes
    • Python class with a poke method
    • poke will be called repeatedly until it returns True

    View Slide

  21. class MyFirstSensor(BaseSensorOperator):
    def poke(self, context):
    current_minute = datetime.now().minute
    if current_minute % 3 != 0:
    log.info('Current minute (%s) not is divisible by 3, '
    'sensor will retry.', current_minute)
    return False
    log.info('Current minute (%s) is divisible by 3, '
    'sensor finishing.', current_minute)
    task_instance = context['task_instance']
    task_instance.xcom_push('sensors_minute', current_minute)
    return True

    View Slide

  22. AIRFLOW CONCEPTS: XCOM
    • means of communication between task instances
    • saved in database as a pickled object
    • best suited for small pieces of data (ids, etc.)

    View Slide

  23. def execute(self, context):
    ...
    task_instance = context['task_instance']
    task_instance.xcom_push('sensors_minute', current_minute)
    def execute(self, context):
    ...
    task_instance = context['task_instance']
    sensors_minute = task_instance.xcom_pull('sensor_task_id', key='sensors_minute')
    log.info('Valid minute as determined by sensor: %s', sensors_minute)
    XCom Push:
    XCom Pull:

    View Slide

  24. def execute(self, context):
    log.info('XCom: Scanning upstream tasks for Database IDs')
    task_instance = context['task_instance']
    upstream_tasks = self.get_flat_relatives(upstream=True)
    upstream_task_ids = [task.task_id for task in upstream_tasks]
    upstream_database_ids = task_instance.xcom_pull(task_ids=upstream_task_ids, key='db_id')
    log.info('XCom: Found the following Database IDs: %s', upstream_database_ids)
    SCAN FOR INFORMATION UPSTREAM

    View Slide

  25. REUSABLE OPERATORS
    • loosely coupled
    • with few necessary XCom parameters
    • most parameters are optional
    • sane defaults
    • will adapt if information appears upstream

    View Slide

  26. A TYPICAL WORKFLOW
    Operators
    Sensor
    XCom

    View Slide

  27. CONDITIONAL EXECUTION:
    BRANCH OPERATOR
    • decide which branch of the graph to follow
    • all others will be skipped

    View Slide

  28. CONDITIONAL EXECUTION:
    BRANCH OPERATOR
    def choose():
    return 'first'
    with dag:
    branching = BranchPythonOperator(task_id='branching', python_callable=choose)
    branching >> DummyOperator(task_id='first')
    branching >> DummyOperator(task_id='second')

    View Slide

  29. CONDITIONAL EXECUTION:
    AIRFLOW SKIP EXCEPTION
    • raise AirflowSkipException to skip execution of current task
    • all other exceptions cause retries and ultimately the task to fail
    • puts a dam in the river
    def execute(self, context):
    ...
    if not conditions_met:
    log.info('Conditions not met, skipping.')
    raise AirflowSkipException()

    View Slide

  30. CONDITIONAL EXECUTION: 

    TRIGGER RULES
    • decide when a task is triggered
    • defaults to all_success
    • all_done - opens dam
    from downstream task
    class TriggerRule(object):
    ALL_SUCCESS = 'all_success'
    ALL_FAILED = 'all_failed'
    ALL_DONE = 'all_done'
    ONE_SUCCESS = 'one_success'
    ONE_FAILED = 'one_failed'
    DUMMY = 'dummy'

    View Slide

  31. BASH COMMANDS AND TEMPLATES
    • execute Bash command on Worker node
    • use Jinja templates to generate a Bash script
    • define macros - Python functions used in templates

    View Slide

  32. BASH COMMANDS AND TEMPLATES
    templated_command = """
    {% for i in range(5) %}
    echo "execution date: {{ ds }}"
    echo "{{ params.my_param }}"
    {% endfor %}
    """
    BashOperator(
    task_id='templated',
    bash_command=templated_command,
    params={'my_param': 'Value I passed in'},
    dag=dag)

    View Slide

  33. AIRFLOW PLUGINS
    • Add many types of components used by Airflow
    • Subclass of AirflowPlugin
    • File placed in AIRFLOW_HOME/plugins

    View Slide

  34. AIRFLOW PLUGINS
    class MyPlugin(AirflowPlugin):
    name = "my_plugin"
    # A list of classes derived from BaseOperator
    operators = []
    # A list of menu links (flask_admin.base.MenuLink)
    menu_links = []
    # A list of objects created from a class derived from flask_admin.BaseView
    admin_views = []
    # A list of Blueprint object created from flask.Blueprint
    flask_blueprints = []
    # A list of classes derived from BaseHook (connection clients)
    hooks = []
    # A list of classes derived from BaseExecutor (e.g. MesosExecutor)
    executors = []

    View Slide

  35. THANK YOU
    Introductory Airflow tutorial available on my blog:
    michal.karzynski.pl

    View Slide