Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Upcoming features in Airflow 2

Kaxil Naik
August 04, 2020

Upcoming features in Airflow 2

Speaker: Kaxil Naik
Conference: PiterPy
Date: 4 August 2020
Title: What's upcoming in Airflow 2?
Features Covered:
- Scheduler HA
- DAG Serialization
- DAG Versioning
- Stable REST Api
- Official Docker Image & Helm Chart
- Providers Packages
- and other notable changes

Kaxil Naik

August 04, 2020
Tweet

More Decks by Kaxil Naik

Other Decks in Technology

Transcript

  1. What’s upcoming in Airflow 2? PiterPy 4 August 2020 Kaxil

    Naik Senior Data Engineer @ Astronomer Twitter: @kaxil
  2. Who am I? • Airflow Committer, PMC Member & Release

    Manager • Senior Data Engineer @ Astronomer ◦ Part of the Airflow team ◦ Work full-time on Airflow • Previously worked at DataReply • Masters in Data Science & Analytics from Royal Holloway, University of London • Twitter: https://twitter.com/kaxil • Github: https://github.com/kaxil/ • LinkedIn: https://www.linkedin.com/in/kaxil/
  3. About Astronomer Founded in 2018, Astronomer is the commercial developer

    of Apache Airflow, the open-source standard for data orchestration. 100+ Enterprise customers around the world Locations San Francisco London New York Cincinnati Hyderabad Investors 3 of top 10 Airflow committers are Astronomer advisors or employees Mission To help organizations adopt new standards for data orchestration.
  4. Airflow 2 - Highlights • Scheduler High-Availability • DAG Serialization

    • DAG Versioning • Stable Rest API • Functional DAGs • Official Docker Image & Helm Chart • Providers Packages • and other notable changes ... http://gph.is/1VBGIPv
  5. Scheduler High Availability Goals: • Performance - reduce task-to-task schedule

    "lag" • Scalability - increase task throughput by horizontal scaling • Resiliency - kill a scheduler and have tasks continue to be scheduled
  6. Scheduler High Availability - Design • Active-active model. Each scheduler

    does everything • Uses existing database - no new components needed, no extra operational burden • Plan to use row-level-locks in the DB (SELECT … FOR UPDATE) • Will re-evaluate if performance/stress testing show the need
  7. Measuring Performance Key performance we define as "Scheduler lag": •

    Amount of "wasted" time not running tasks • ti.state_date - max(t.end_date for t in upstream_tis) • Zero is the goal (we'll never get to 0.) • Tasks are "echo true" -- tiny but still executing
  8. Preliminary performance results Case: 100 DAG files | 1 DAG

    per file | 10 Tasks per DAG | 1 run per DAG Workers: 4 | Parallelism: 64 1.10.10: 54.17s (σ19.38) Total runtime: 22m22s HA branch - 1 scheduler: 4.39s (σ1.40) Total runtime: 1m10s HA branch - 3 schedulers: 1.96s (σ0.51) Total runtime: 48s
  9. Dag Serialization (Tasks Completed) • Stateless Webserver: Scheduler parses the

    DAG files, serializes them in JSON format & saves them in the Metadata DB. • Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only load each DAG on demand. This helps reduce Webserver startup time and memory. This reduction in time is notable with large number of DAGs. • Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in Docker image) • Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library) • Paves way for DAG Versioning & Scheduler HA
  10. Dag Serialization (Tasks In-Progress for Airflow 2.0) • Decouple DAG

    Parsing and Serializing from the scheduling loop. • Scheduler will fetch DAGs from DB • DAG will be parsed, serialized and saved to DB by a separate component “Serializer”/ “Dag Parser” • This should reduce the delay in Scheduling tasks when the number of DAGs are large
  11. Dag Versioning Current Problem: • Change in DAG structure affects

    viewing previous DagRuns too • Not possible to view the code or DAG Shape associated with an old DagRun • Checking logs of a deleted task in the UI is not straight-forward
  12. Dag Versioning (Current Problem) New task is shown in Graph

    View for older DAG Runs too with “no status”.
  13. Dag Versioning (Proposed Solution - Version Badge) • DAG Version

    Badge: ◦ Hash of Serialized DAG ◦ Use to identify a DAG Version ◦ Links to associated Code in Code View.
  14. Dag Versioning (Proposed Solution) • Graph View shows the actual

    version that was run. • Mixed-Version: If a DAG was changed mid-way, an internal version is calculated that shows the actual version that was run. Two different Versions (Mixed Version)
  15. Dag Versioning Goal: • Scope is limited to the make

    sure visibility behavior of Airflow is correct ◦ No change in the execution behaviour ◦ Execution will continue to be based on the most recent version of the DAG • Support for storing multiple versions of Serialized DAGs • All the views show the correct DAG associated with that DagRun • Baked-In Maintenance DAGs ◦ Cleanup old DagRuns & associated Serialized DAGs
  16. Stable REST API • Built using Connexion ◦ OpenAPI compliant

    ◦ Spec-first approach ◦ Maps endpoints to Python functions • Migration guide from the old experimental API to the new Stable REST API • Integration with the current Flask-Appbuilder Permission model ◦ Experimental API didn’t have a permission model, it was all or nothing
  17. Clients in multiple languages for the API • Using OpenAPI

    Generator we can generate clients for multiple languages.
  18. Functional DAGs Current Problem: • No explicit way of passing

    data between tasks ◦ XCom exists but are hidden in task execution ◦ XCom isn’t intuitive for beginners • Need of Boiler plate code for: ◦ Task dependencies ◦ Using Python function in PythonOperator • Can’t pass large data between tasks with Xcom
  19. Example DAG - the normal way “load” uses the data

    from “transform” task but you still need to define explicit dependency between them Too much boilerplate code to read data from previous task using XCom
  20. Example DAG - the functional way Simple and readable code

    to get the data (without knowledge of XCom and Jinja) Convert function into an Operator and pass arguments in a more pythonic way Also assigns the task to the DAG No need of explicitly defining Task Dependencies
  21. Functional DAGs • Automatically convert Functions into a Task using

    decorators: ◦ @airflow.decorators.task ◦ @dag.task • Task ids are automatically generated • Pluggable XCom Storage Engine: ◦ Store and retrieve data in GCS, S3, etc • Simplifies writing DAG Code and increases readability • Task dependencies are automatically set when a task uses data from other task. • Backwards compatible
  22. Official Container Image • Alpha-quality image was available since Airflow

    1.10.10 • Beta-quality community image available since Airflow 1.10.11 • Available at https://hub.docker.com/r/apache/airflow ◦ Run docker pull apache/airflow to pull the latest image
  23. Official Container Image • Supported Python Versions: 2.7, 3.5, 3.6,

    3.7, 3.8 • Size: ~ 210 MB (compressed size) • Uses python slim-buster images as the base image by default
  24. Image Customization options • Choose Base image (python) • Install

    Airflow from PyPI • Install from GitHub branch/tag • Install additional extras, python deps, apt dev/runtime deps • Choose different UID/GID • Choose different AIRFLOW_HOME • Choose different HOME dir
  25. Helm Chart • Donated by Astronomer ◦ Battle-tested & used

    by hundreds of production deployments run by Astronomer • Uses official Airflow Docker Image • Currently used in Airflow CI to run Kubernetes tests • Check https://github.com/apache/airflow/tree/master/chart for details
  26. Helm Chart • Supports KEDA (Kubernetes-based Event Driven Autoscaling) •

    Supports Sequential, Local, Celery and Kubernetes Executor • Several options to mount DAGs: ◦ using git-sync side-car without Persistence ◦ from a PVC • Liveness Probe restarts Scheduler on heartbeat failure
  27. Providers Package • As part of AIP-21 all the contents

    (hooks/operators/sensors) from airflow/contrib directory were grouped into different providers and moved to airflow/providers packages.
  28. Backport Providers Package • Bring Airflow 2.0 providers to 1.10.*

    • Packages per-provider • 58 packages • Python 3.6+ only • Automatically tested on CI • Separate cadence than Airflow
  29. Other notable changes • Python 3 only ✔ Python 2.7

    unsupported upstream since Jan 1, 2020 • "RBAC" UI is now the only UI ✔ Was a config option before, now only option. Charts/data profiling removed due to security risks • Improvements to SubDags ✔ Was run as a backfill job and didn’t always respect Pools / Concurrency limits • CLI Refactor ✔ Commands are now grouped functionally
  30. How to upgrade to 2.0 safely • Install the latest

    1.10 release • Run airflow upgrade-check (doesn't exist, yet #8765) • Fix any warnings • Upgrade Airflow
  31. Links • Airflow Summit Talks: ◦ Keynote: Future of Airflow

    by Kaxil, Ash, Jarek, Kamil, Tomek & Daniel ◦ AIP-31: Airflow functional DAG definition by Gerard Casas Saez ◦ Production Docker image for Apache Airflow by Jarek Potiuk • AIPs (Airflow Improvement Proposals): ◦ AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance ◦ AIP-21 Changes in import paths ◦ AIP-24 DAG Serialization ◦ AIP-32 Airflow REST API ◦ AIP-36 DAG Versioning
  32. Links • Airflow ◦ Repo: https://github.com/apache/airflow ◦ Website: https://airflow.apache.org/ ◦

    Blog: https://airflow.apache.org/blog/ ◦ Documentation: https://airflow.apache.org/docs/stable/ ◦ Slack: https://s.apache.org/airflow-slack ◦ Twitter: https://twitter.com/apacheairflow • Contact Me: ◦ Twitter: https://twitter.com/kaxil ◦ Github: https://github.com/kaxil/ ◦ LinkedIn: https://www.linkedin.com/in/kaxil/