Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow

Apache Airflow

Stefan Seelmann

July 26, 2018
Tweet

More Decks by Stefan Seelmann

Other Decks in Technology

Transcript

  1. Apache Airflow Agenda SimScale © 2018 • Introduction • Core

    concepts • Web UI Demo • Architecture and Deployment Options • Use Cases and Patterns • Usage at SimScale • Tips and Pitfalls
  2. Apache Airflow Apache Airflow Facts SimScale © 2018 • Created

    2014 by Maxime Beauchemin at Airbnb • Since 2016 at Apache (incubator), Apache License • 400+ contributors • Used by 150+ companies • Version 1.9, 1.10 under vote • Documentation: https://airflow.apache.org/ • Code: https://github.com/apache/incubator-airflow
  3. Apache Airflow What is it? SimScale © 2018 Airflow is

    a platform to programmatically author, schedule and monitor workflows* • A workflow is a DAG of tasks, defined as Python code • The Scheduler executes tasks on workers • A Web UI to visualize, monitor, and troubleshoot runs * https://airflow.apache.org/index.html
  4. Apache Airflow Concepts SimScale © 2018 • DAG ◦ Collection

    of tasks with their relationships and dependencies • Task ◦ Instance of an operator • Operator and Sensor ◦ Python class that defines what to do ◦ PythonOperator, BashOperator ◦ SimpleHttpOperator, PostgresOperator, DockerOperator ◦ S3KeySensors, HdfsSensor
  5. Apache Airflow Concepts SimScale © 2018 dag = DAG(dag_id='demo', default_args=default_args,

    schedule_interval=None) with dag: task1 = PythonOperator(task_id='demo_python_task', python_callable=lambda: print("foo")) task2 = BashOperator(task_id='demo_bash_task', bash_command='echo "bar"') task1 >> task2
  6. Apache Airflow Concepts SimScale © 2018 • DAG run and

    task instance ◦ Specific run of a DAG and each of its tasks ◦ Identified by dag_id, task_id, and execution_date ◦ Persisted in metadata database (tables dag_run and task_instance) ◦ Have a mutating state: queued, running, success, failed, skipped, ... • Many more: ◦ Conf, XCom, Variables, Hooks, SubDAGs, Templating, Trigger rules, SLAs, ...
  7. Apache Airflow Web UI Demo SimScale © 2018 • DAGs

    view • Tree view • Graph view • Gantt chart view • Logs
  8. Apache Airflow Architecture SimScale © 2018 • Scheduler: Single instance,

    multiple threads • Webserver • Metadata database • DAG files • Configuration • Executor and workers https://www.slideshare.net/sumitmaheshwari007/apache-airflow
  9. Apache Airflow Architecture - Executors SimScale © 2018 • SequentialExecutor

    ◦ For local testing and debugging • LocalExecutor ◦ Coupled to scheduler process, vertical scaling • CeleryExecutor ◦ Separate worker nodes, complex setup, requires message queue • MesosExecutor (contrib) • KubernetesExecutor (since 1.10)
  10. Apache Airflow Architecture - Limit resource usage SimScale © 2018

    • Global: dag_concurrency, max_active_runs_per_dag • Per DAG: max_active_runs, concurrency • Per Task: task_concurrency, pool, queue
  11. Apache Airflow Use Cases and Patterns SimScale © 2018 •

    Use cases: ◦ ETL, data pipelines, machine learning ◦ Arbitrary workflows • Scheduling ◦ Periodic: execution_date, catchup, start_date ◦ External trigger: schedule_interval=None + DAG run conf • Execution patterns ◦ Processing ◦ Orchestration (trigger work on external system + sensor)
  12. Apache Airflow Security SimScale © 2018 • “By default, all

    gates are opened.” • Web UI ◦ Pluggable authentication (LDAP, Kerberos, OAuth) ◦ RBAC (since 1.10) • REST API (experimental) ◦ Deny or Kerberos • CLI
  13. Apache Airflow Usage at SimScale - About SimScale SimScale ©

    2018 • Simulation platform for engineers • Supported analysis types: ◦ CFD (Computational Fluid Dynamics): fluid flow ◦ FEA (Finite Element Analysis): stress, deformation ◦ Thermal analysis: heat transfer
  14. Apache Airflow Example FEA: Stress and Displacement of a Bike

    Frame SimScale © 2018 https://www.simscale.com/projects/jprobst/bike_frame_analysis_1/
  15. Apache Airflow Example CFD: Airflow around Singapore SimScale © 2018

    https://www.simscale.com/projects/Milad_Mafi/airflow_around_singapore/
  16. Apache Airflow Usage at SimScale - Workflows SimScale © 2018

    • Simulation consist of multiple steps • Currently a more or less static process: ◦ Preparation, run simulation (hours to days), persist results, generate visualization and artifacts for postprocessing • In future more flexibility is required: ◦ Started to look into Airflow in March 2018 ◦ First workflow to generate postprocessing artifacts in production ◦ Workflow for a new simulation type in progress
  17. Apache Airflow Usage at SimScale - Patterns SimScale © 2018

    • No periodic scheduled workflows, but event triggered ◦ Via AWS SQS, message contains the dag_id to trigger and all input parameters (pointers to immutable data like CAD files and simulation spec) ◦ One periodic job that polls SQS and triggers DAG runs • No actual processing, but orchestration ◦ Trigger computation at external system, use sensor to monitor progress • Some custom operators (JSON, context headers)
  18. Apache Airflow Usage at SimScale - Deployment SimScale © 2018

    • Docker image with Airflow, DAGs, custom code • Deployment to AWS ECS with 4 roles ◦ 1 webserver, 1 flower, 1 scheduler, N workers ◦ Open issue: graceful shutdown of workers • Configuration: airflow.cfg + environment variables + parameter store • Metadata database: AWS RDS (PostgreSQL) • Message broker: AWS ElastiCache Redis • Logs to S3 and ELK (ugly) • Basic monitoring: scheduler heartbeat, statsd -> prometheus
  19. Apache Airflow Tips and Pitfalls SimScale © 2018 • Be

    careful when updating existing/running DAGs ◦ Don’t rename DAGs and tasks • Use CeleryExecutor for production ◦ Or KubernetesExecutor with 1.10? • execution_date is part of primary key • A running sensor blocks a worker slot • Separate business logic from Airflow tasks (unit tests) • Make your DAGs and tasks idempotent and deterministic to allow retries