What’s upcoming in Airflow 2? PiterPy 4 August 2020 Kaxil Naik Senior Data Engineer @ Astronomer Twitter: @kaxil

Who am I? ● Airflow Committer, PMC Member & Release Manager ● Senior Data Engineer @ Astronomer ○ Part of the Airflow team ○ Work full-time on Airflow ● Previously worked at DataReply ● Masters in Data Science & Analytics from Royal Holloway, University of London ● Twitter: ● Github: ● LinkedIn:

About Astronomer Founded in 2018, Astronomer is the commercial developer of Apache Airflow, the open-source standard for data orchestration. 100+ Enterprise customers around the world Locations San Francisco London New York Cincinnati Hyderabad Investors 3 of top 10 Airflow committers are Astronomer advisors or employees Mission To help organizations adopt new standards for data orchestration.

Airflow 2 - Highlights ● Scheduler High-Availability ● DAG Serialization ● DAG Versioning ● Stable Rest API ● Functional DAGs ● Official Docker Image & Helm Chart ● Providers Packages ● and other notable changes ...

High Availability

Scheduler High Availability Goals: ● Performance - reduce task-to-task schedule "lag" ● Scalability - increase task throughput by horizontal scaling ● Resiliency - kill a scheduler and have tasks continue to be scheduled

Scheduler High Availability - Design ● Active-active model. Each scheduler does everything ● Uses existing database - no new components needed, no extra operational burden ● Plan to use row-level-locks in the DB (SELECT … FOR UPDATE) ● Will re-evaluate if performance/stress testing show the need

Example HA configuration

Measuring Performance Key performance we define as "Scheduler lag": ● Amount of "wasted" time not running tasks ● ti.state_date - max(t.end_date for t in upstream_tis) ● Zero is the goal (we'll never get to 0.) ● Tasks are "echo true" -- tiny but still executing

Preliminary performance results Case: 100 DAG files | 1 DAG per file | 10 Tasks per DAG | 1 run per DAG Workers: 4 | Parallelism: 64 1.10.10: 54.17s (σ19.38) Total runtime: 22m22s HA branch - 1 scheduler: 4.39s (σ1.40) Total runtime: 1m10s HA branch - 3 schedulers: 1.96s (σ0.51) Total runtime: 48s

DAG Serialization

DAG Serialization

Serialized DAG Representation

Dag Serialization (Tasks Completed) ● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves them in the Metadata DB. ● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only load each DAG on demand. This helps reduce Webserver startup time and memory. This reduction in time is notable with large number of DAGs. ● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in Docker image) ● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library) ● Paves way for DAG Versioning & Scheduler HA

Dag Serialization (Tasks In-Progress for Airflow 2.0) ● Decouple DAG Parsing and Serializing from the scheduling loop. ● Scheduler will fetch DAGs from DB ● DAG will be parsed, serialized and saved to DB by a separate component “Serializer”/ “Dag Parser” ● This should reduce the delay in Scheduling tasks when the number of DAGs are large

DAG Versioning

Dag Versioning Current Problem: ● Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code or DAG Shape associated with an old DagRun ● Checking logs of a deleted task in the UI is not straight-forward

Dag Versioning (Current Problem)

Dag Versioning (Current Problem) New task is shown in Graph View for older DAG Runs too with “no status”.

Dag Versioning (Proposed Solution - Version Badge) ● DAG Version Badge: ○ Hash of Serialized DAG ○ Use to identify a DAG Version ○ Links to associated Code in Code View.

Dag Versioning (Proposed Solution) ● Graph View shows the actual version that was run. ● Mixed-Version: If a DAG was changed mid-way, an internal version is calculated that shows the actual version that was run. Two different Versions (Mixed Version)

Dag Versioning Goal: ● Scope is limited to the make sure visibility behavior of Airflow is correct ○ No change in the execution behaviour ○ Execution will continue to be based on the most recent version of the DAG ● Support for storing multiple versions of Serialized DAGs ● All the views show the correct DAG associated with that DagRun ● Baked-In Maintenance DAGs ○ Cleanup old DagRuns & associated Serialized DAGs

API: follows Open API 3.0 specification

Stable REST API ● Built using Connexion ○ OpenAPI compliant ○ Spec-first approach ○ Maps endpoints to Python functions ● Migration guide from the old experimental API to the new Stable REST API ● Integration with the current Flask-Appbuilder Permission model ○ Experimental API didn’t have a permission model, it was all or nothing

Clients in multiple languages for the API ● Using OpenAPI Generator we can generate clients for multiple languages.

Functional DAGs

Functional DAGs Current Problem: ● No explicit way of passing data between tasks ○ XCom exists but are hidden in task execution ○ XCom isn’t intuitive for beginners ● Need of Boiler plate code for: ○ Task dependencies ○ Using Python function in PythonOperator ● Can’t pass large data between tasks with Xcom

Example DAG - the normal way “load” uses the data from “transform” task but you still need to define explicit dependency between them Too much boilerplate code to read data from previous task using XCom

Example DAG - the functional way Simple and readable code to get the data (without knowledge of XCom and Jinja) Convert function into an Operator and pass arguments in a more pythonic way Also assigns the task to the DAG No need of explicitly defining Task Dependencies

Functional DAGs ● Automatically convert Functions into a Task using decorators: ○ @airflow.decorators.task ○ @dag.task ● Task ids are automatically generated ● Pluggable XCom Storage Engine: ○ Store and retrieve data in GCS, S3, etc ● Simplifies writing DAG Code and increases readability ● Task dependencies are automatically set when a task uses data from other task. ● Backwards compatible

Official Docker Image & Helm Chart

Official Container Image ● Alpha-quality image was available since Airflow 1.10.10 ● Beta-quality community image available since Airflow 1.10.11 ● Available at ○ Run docker pull apache/airflow to pull the latest image

Official Container Image ● Supported Python Versions: 2.7, 3.5, 3.6, 3.7, 3.8 ● Size: ~ 210 MB (compressed size) ● Uses python slim-buster images as the base image by default

Image Customization options ● Choose Base image (python) ● Install Airflow from PyPI ● Install from GitHub branch/tag ● Install additional extras, python deps, apt dev/runtime deps ● Choose different UID/GID ● Choose different AIRFLOW_HOME ● Choose different HOME dir

Image Customization options Details:

Helm Chart ● Donated by Astronomer ○ Battle-tested & used by hundreds of production deployments run by Astronomer ● Uses official Airflow Docker Image ● Currently used in Airflow CI to run Kubernetes tests ● Check for details

Helm Chart ● Supports KEDA (Kubernetes-based Event Driven Autoscaling) ● Supports Sequential, Local, Celery and Kubernetes Executor ● Several options to mount DAGs: ○ using git-sync side-car without Persistence ○ from a PVC ● Liveness Probe restarts Scheduler on heartbeat failure

Providers Package

Providers Package ● As part of AIP-21 all the contents (hooks/operators/sensors) from airflow/contrib directory were grouped into different providers and moved to airflow/providers packages.

Backport Providers Package ● Bring Airflow 2.0 providers to 1.10.* ● Packages per-provider ● 58 packages ● Python 3.6+ only ● Automatically tested on CI ● Separate cadence than Airflow

Other Notable Changes ...

Other notable changes ● Python 3 only ✔ Python 2.7 unsupported upstream since Jan 1, 2020 ● "RBAC" UI is now the only UI ✔ Was a config option before, now only option. Charts/data profiling removed due to security risks ● Improvements to SubDags ✔ Was run as a backfill job and didn’t always respect Pools / Concurrency limits ● CLI Refactor ✔ Commands are now grouped functionally

Road to Airflow 2.0

When will Airflow 2.0 be available?

How to upgrade to 2.0 safely ● Install the latest 1.10 release ● Run airflow upgrade-check (doesn't exist, yet #8765) ● Fix any warnings ● Upgrade Airflow

Links / References

Links ● Airflow Summit Talks: ○ Keynote: Future of Airflow by Kaxil, Ash, Jarek, Kamil, Tomek & Daniel ○ AIP-31: Airflow functional DAG definition by Gerard Casas Saez ○ Production Docker image for Apache Airflow by Jarek Potiuk ● AIPs (Airflow Improvement Proposals): ○ AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance ○ AIP-21 Changes in import paths ○ AIP-24 DAG Serialization ○ AIP-32 Airflow REST API ○ AIP-36 DAG Versioning

Links ● Airflow ○ Repo: ○ Website: ○ Blog: ○ Documentation: ○ Slack: ○ Twitter: ● Contact Me: ○ Twitter: ○ Github: ○ LinkedIn:

