Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Apache Airflow

Avatar for Stefan Seelmann

Stefan Seelmann

July 26, 2018
Tweet

More Decks by Stefan Seelmann

Other Decks in Technology

Transcript

  1. Apache Airflow Agenda SimScale © 2018 • Introduction • Core

    concepts • Web UI Demo • Architecture and Deployment Options • Use Cases and Patterns • Usage at SimScale • Tips and Pitfalls
  2. Apache Airflow Apache Airflow Facts SimScale © 2018 • Created

    2014 by Maxime Beauchemin at Airbnb • Since 2016 at Apache (incubator), Apache License • 400+ contributors • Used by 150+ companies • Version 1.9, 1.10 under vote • Documentation: https://airflow.apache.org/ • Code: https://github.com/apache/incubator-airflow
  3. Apache Airflow What is it? SimScale © 2018 Airflow is

    a platform to programmatically author, schedule and monitor workflows* • A workflow is a DAG of tasks, defined as Python code • The Scheduler executes tasks on workers • A Web UI to visualize, monitor, and troubleshoot runs * https://airflow.apache.org/index.html
  4. Apache Airflow Concepts SimScale © 2018 • DAG ◦ Collection

    of tasks with their relationships and dependencies • Task ◦ Instance of an operator • Operator and Sensor ◦ Python class that defines what to do ◦ PythonOperator, BashOperator ◦ SimpleHttpOperator, PostgresOperator, DockerOperator ◦ S3KeySensors, HdfsSensor
  5. Apache Airflow Concepts SimScale © 2018 dag = DAG(dag_id='demo', default_args=default_args,

    schedule_interval=None) with dag: task1 = PythonOperator(task_id='demo_python_task', python_callable=lambda: print("foo")) task2 = BashOperator(task_id='demo_bash_task', bash_command='echo "bar"') task1 >> task2
  6. Apache Airflow Concepts SimScale © 2018 • DAG run and

    task instance ◦ Specific run of a DAG and each of its tasks ◦ Identified by dag_id, task_id, and execution_date ◦ Persisted in metadata database (tables dag_run and task_instance) ◦ Have a mutating state: queued, running, success, failed, skipped, ... • Many more: ◦ Conf, XCom, Variables, Hooks, SubDAGs, Templating, Trigger rules, SLAs, ...
  7. Apache Airflow Web UI Demo SimScale © 2018 • DAGs

    view • Tree view • Graph view • Gantt chart view • Logs
  8. Apache Airflow Architecture SimScale © 2018 • Scheduler: Single instance,

    multiple threads • Webserver • Metadata database • DAG files • Configuration • Executor and workers https://www.slideshare.net/sumitmaheshwari007/apache-airflow
  9. Apache Airflow Architecture - Executors SimScale © 2018 • SequentialExecutor

    ◦ For local testing and debugging • LocalExecutor ◦ Coupled to scheduler process, vertical scaling • CeleryExecutor ◦ Separate worker nodes, complex setup, requires message queue • MesosExecutor (contrib) • KubernetesExecutor (since 1.10)
  10. Apache Airflow Architecture - Limit resource usage SimScale © 2018

    • Global: dag_concurrency, max_active_runs_per_dag • Per DAG: max_active_runs, concurrency • Per Task: task_concurrency, pool, queue
  11. Apache Airflow Use Cases and Patterns SimScale © 2018 •

    Use cases: ◦ ETL, data pipelines, machine learning ◦ Arbitrary workflows • Scheduling ◦ Periodic: execution_date, catchup, start_date ◦ External trigger: schedule_interval=None + DAG run conf • Execution patterns ◦ Processing ◦ Orchestration (trigger work on external system + sensor)
  12. Apache Airflow Security SimScale © 2018 • “By default, all

    gates are opened.” • Web UI ◦ Pluggable authentication (LDAP, Kerberos, OAuth) ◦ RBAC (since 1.10) • REST API (experimental) ◦ Deny or Kerberos • CLI
  13. Apache Airflow Usage at SimScale - About SimScale SimScale ©

    2018 • Simulation platform for engineers • Supported analysis types: ◦ CFD (Computational Fluid Dynamics): fluid flow ◦ FEA (Finite Element Analysis): stress, deformation ◦ Thermal analysis: heat transfer
  14. Apache Airflow Example FEA: Stress and Displacement of a Bike

    Frame SimScale © 2018 https://www.simscale.com/projects/jprobst/bike_frame_analysis_1/
  15. Apache Airflow Example CFD: Airflow around Singapore SimScale © 2018

    https://www.simscale.com/projects/Milad_Mafi/airflow_around_singapore/
  16. Apache Airflow Usage at SimScale - Workflows SimScale © 2018

    • Simulation consist of multiple steps • Currently a more or less static process: ◦ Preparation, run simulation (hours to days), persist results, generate visualization and artifacts for postprocessing • In future more flexibility is required: ◦ Started to look into Airflow in March 2018 ◦ First workflow to generate postprocessing artifacts in production ◦ Workflow for a new simulation type in progress
  17. Apache Airflow Usage at SimScale - Patterns SimScale © 2018

    • No periodic scheduled workflows, but event triggered ◦ Via AWS SQS, message contains the dag_id to trigger and all input parameters (pointers to immutable data like CAD files and simulation spec) ◦ One periodic job that polls SQS and triggers DAG runs • No actual processing, but orchestration ◦ Trigger computation at external system, use sensor to monitor progress • Some custom operators (JSON, context headers)
  18. Apache Airflow Usage at SimScale - Deployment SimScale © 2018

    • Docker image with Airflow, DAGs, custom code • Deployment to AWS ECS with 4 roles ◦ 1 webserver, 1 flower, 1 scheduler, N workers ◦ Open issue: graceful shutdown of workers • Configuration: airflow.cfg + environment variables + parameter store • Metadata database: AWS RDS (PostgreSQL) • Message broker: AWS ElastiCache Redis • Logs to S3 and ELK (ugly) • Basic monitoring: scheduler heartbeat, statsd -> prometheus
  19. Apache Airflow Tips and Pitfalls SimScale © 2018 • Be

    careful when updating existing/running DAGs ◦ Don’t rename DAGs and tasks • Use CeleryExecutor for production ◦ Or KubernetesExecutor with 1.10? • execution_date is part of primary key • A running sensor blocks a worker slot • Separate business logic from Airflow tasks (unit tests) • Make your DAGs and tasks idempotent and deterministic to allow retries