Slide 1

Slide 1 text

What’s upcoming in Airflow 2? PiterPy 4 August 2020 Kaxil Naik Senior Data Engineer @ Astronomer Twitter: @kaxil

Slide 2

Slide 2 text

Who am I? ● Airflow Committer, PMC Member & Release Manager ● Senior Data Engineer @ Astronomer ○ Part of the Airflow team ○ Work full-time on Airflow ● Previously worked at DataReply ● Masters in Data Science & Analytics from Royal Holloway, University of London ● Twitter: https://twitter.com/kaxil ● Github: https://github.com/kaxil/ ● LinkedIn: https://www.linkedin.com/in/kaxil/

Slide 3

Slide 3 text

About Astronomer Founded in 2018, Astronomer is the commercial developer of Apache Airflow, the open-source standard for data orchestration. 100+ Enterprise customers around the world Locations San Francisco London New York Cincinnati Hyderabad Investors 3 of top 10 Airflow committers are Astronomer advisors or employees Mission To help organizations adopt new standards for data orchestration.

Slide 4

Slide 4 text

Airflow 2 - Highlights ● Scheduler High-Availability ● DAG Serialization ● DAG Versioning ● Stable Rest API ● Functional DAGs ● Official Docker Image & Helm Chart ● Providers Packages ● and other notable changes ... http://gph.is/1VBGIPv

Slide 5

Slide 5 text

High Availability

Slide 6

Slide 6 text

Scheduler High Availability Goals: ● Performance - reduce task-to-task schedule "lag" ● Scalability - increase task throughput by horizontal scaling ● Resiliency - kill a scheduler and have tasks continue to be scheduled

Slide 7

Slide 7 text

Scheduler High Availability - Design ● Active-active model. Each scheduler does everything ● Uses existing database - no new components needed, no extra operational burden ● Plan to use row-level-locks in the DB (SELECT … FOR UPDATE) ● Will re-evaluate if performance/stress testing show the need

Slide 8

Slide 8 text

Example HA configuration

Slide 9

Slide 9 text

Measuring Performance Key performance we define as "Scheduler lag": ● Amount of "wasted" time not running tasks ● ti.state_date - max(t.end_date for t in upstream_tis) ● Zero is the goal (we'll never get to 0.) ● Tasks are "echo true" -- tiny but still executing

Slide 10

Slide 10 text

Preliminary performance results Case: 100 DAG files | 1 DAG per file | 10 Tasks per DAG | 1 run per DAG Workers: 4 | Parallelism: 64 1.10.10: 54.17s (σ19.38) Total runtime: 22m22s HA branch - 1 scheduler: 4.39s (σ1.40) Total runtime: 1m10s HA branch - 3 schedulers: 1.96s (σ0.51) Total runtime: 48s

Slide 11

Slide 11 text

DAG Serialization

Slide 12

Slide 12 text

DAG Serialization

Slide 13

Slide 13 text

Serialized DAG Representation

Slide 14

Slide 14 text

Dag Serialization (Tasks Completed) ● Stateless Webserver: Scheduler parses the DAG files, serializes them in JSON format & saves them in the Metadata DB. ● Lazy Loading of DAGs: Instead of loading an entire DagBag when the Webserver starts we only load each DAG on demand. This helps reduce Webserver startup time and memory. This reduction in time is notable with large number of DAGs. ● Deploying new DAGs to Airflow - no longer requires long restarts of webserver (if DAGs are baked in Docker image) ● Feature to use the “JSON” library of choice for Serialization (default is inbuilt ‘json’ library) ● Paves way for DAG Versioning & Scheduler HA

Slide 15

Slide 15 text

Dag Serialization (Tasks In-Progress for Airflow 2.0) ● Decouple DAG Parsing and Serializing from the scheduling loop. ● Scheduler will fetch DAGs from DB ● DAG will be parsed, serialized and saved to DB by a separate component “Serializer”/ “Dag Parser” ● This should reduce the delay in Scheduling tasks when the number of DAGs are large

Slide 16

Slide 16 text

DAG Versioning

Slide 17

Slide 17 text

Dag Versioning Current Problem: ● Change in DAG structure affects viewing previous DagRuns too ● Not possible to view the code or DAG Shape associated with an old DagRun ● Checking logs of a deleted task in the UI is not straight-forward

Slide 18

Slide 18 text

Dag Versioning (Current Problem)

Slide 19

Slide 19 text

Dag Versioning (Current Problem) New task is shown in Graph View for older DAG Runs too with “no status”.

Slide 20

Slide 20 text

Dag Versioning (Proposed Solution - Version Badge) ● DAG Version Badge: ○ Hash of Serialized DAG ○ Use to identify a DAG Version ○ Links to associated Code in Code View.

Slide 21

Slide 21 text

Dag Versioning (Proposed Solution) ● Graph View shows the actual version that was run. ● Mixed-Version: If a DAG was changed mid-way, an internal version is calculated that shows the actual version that was run. Two different Versions (Mixed Version)

Slide 22

Slide 22 text

Dag Versioning Goal: ● Scope is limited to the make sure visibility behavior of Airflow is correct ○ No change in the execution behaviour ○ Execution will continue to be based on the most recent version of the DAG ● Support for storing multiple versions of Serialized DAGs ● All the views show the correct DAG associated with that DagRun ● Baked-In Maintenance DAGs ○ Cleanup old DagRuns & associated Serialized DAGs

Slide 23

Slide 23 text

Stable REST API

Slide 24

Slide 24 text

API: follows Open API 3.0 specification

Slide 25

Slide 25 text

Stable REST API ● Built using Connexion ○ OpenAPI compliant ○ Spec-first approach ○ Maps endpoints to Python functions ● Migration guide from the old experimental API to the new Stable REST API ● Integration with the current Flask-Appbuilder Permission model ○ Experimental API didn’t have a permission model, it was all or nothing

Slide 26

Slide 26 text

Clients in multiple languages for the API ● Using OpenAPI Generator we can generate clients for multiple languages.

Slide 27

Slide 27 text

Functional DAGs

Slide 28

Slide 28 text

Functional DAGs Current Problem: ● No explicit way of passing data between tasks ○ XCom exists but are hidden in task execution ○ XCom isn’t intuitive for beginners ● Need of Boiler plate code for: ○ Task dependencies ○ Using Python function in PythonOperator ● Can’t pass large data between tasks with Xcom

Slide 29

Slide 29 text

Example DAG - the normal way “load” uses the data from “transform” task but you still need to define explicit dependency between them Too much boilerplate code to read data from previous task using XCom

Slide 30

Slide 30 text

Example DAG - the functional way Simple and readable code to get the data (without knowledge of XCom and Jinja) Convert function into an Operator and pass arguments in a more pythonic way Also assigns the task to the DAG No need of explicitly defining Task Dependencies

Slide 31

Slide 31 text

Functional DAGs ● Automatically convert Functions into a Task using decorators: ○ @airflow.decorators.task ○ @dag.task ● Task ids are automatically generated ● Pluggable XCom Storage Engine: ○ Store and retrieve data in GCS, S3, etc ● Simplifies writing DAG Code and increases readability ● Task dependencies are automatically set when a task uses data from other task. ● Backwards compatible

Slide 32

Slide 32 text

Official Docker Image & Helm Chart

Slide 33

Slide 33 text

Official Container Image ● Alpha-quality image was available since Airflow 1.10.10 ● Beta-quality community image available since Airflow 1.10.11 ● Available at https://hub.docker.com/r/apache/airflow ○ Run docker pull apache/airflow to pull the latest image

Slide 34

Slide 34 text

Official Container Image ● Supported Python Versions: 2.7, 3.5, 3.6, 3.7, 3.8 ● Size: ~ 210 MB (compressed size) ● Uses python slim-buster images as the base image by default

Slide 35

Slide 35 text

Image Customization options ● Choose Base image (python) ● Install Airflow from PyPI ● Install from GitHub branch/tag ● Install additional extras, python deps, apt dev/runtime deps ● Choose different UID/GID ● Choose different AIRFLOW_HOME ● Choose different HOME dir

Slide 36

Slide 36 text

Image Customization options Details: https://github.com/apache/airflow/blob/master/IMAGES.rst#ci-images

Slide 37

Slide 37 text

Helm Chart ● Donated by Astronomer ○ Battle-tested & used by hundreds of production deployments run by Astronomer ● Uses official Airflow Docker Image ● Currently used in Airflow CI to run Kubernetes tests ● Check https://github.com/apache/airflow/tree/master/chart for details

Slide 38

Slide 38 text

Helm Chart ● Supports KEDA (Kubernetes-based Event Driven Autoscaling) ● Supports Sequential, Local, Celery and Kubernetes Executor ● Several options to mount DAGs: ○ using git-sync side-car without Persistence ○ from a PVC ● Liveness Probe restarts Scheduler on heartbeat failure

Slide 39

Slide 39 text

Providers Package

Slide 40

Slide 40 text

Providers Package ● As part of AIP-21 all the contents (hooks/operators/sensors) from airflow/contrib directory were grouped into different providers and moved to airflow/providers packages.

Slide 41

Slide 41 text

Backport Providers Package ● Bring Airflow 2.0 providers to 1.10.* ● Packages per-provider ● 58 packages ● Python 3.6+ only ● Automatically tested on CI ● Separate cadence than Airflow

Slide 42

Slide 42 text

Other Notable Changes ...

Slide 43

Slide 43 text

Other notable changes ● Python 3 only ✔ Python 2.7 unsupported upstream since Jan 1, 2020 ● "RBAC" UI is now the only UI ✔ Was a config option before, now only option. Charts/data profiling removed due to security risks ● Improvements to SubDags ✔ Was run as a backfill job and didn’t always respect Pools / Concurrency limits ● CLI Refactor ✔ Commands are now grouped functionally

Slide 44

Slide 44 text

Road to Airflow 2.0

Slide 45

Slide 45 text

When will Airflow 2.0 be available?

Slide 46

Slide 46 text

How to upgrade to 2.0 safely ● Install the latest 1.10 release ● Run airflow upgrade-check (doesn't exist, yet #8765) ● Fix any warnings ● Upgrade Airflow

Slide 47

Slide 47 text

Links / References

Slide 48

Slide 48 text

Links ● Airflow Summit Talks: ○ Keynote: Future of Airflow by Kaxil, Ash, Jarek, Kamil, Tomek & Daniel ○ AIP-31: Airflow functional DAG definition by Gerard Casas Saez ○ Production Docker image for Apache Airflow by Jarek Potiuk ● AIPs (Airflow Improvement Proposals): ○ AIP-15 Support Multiple-Schedulers for HA & Better Scheduling Performance ○ AIP-21 Changes in import paths ○ AIP-24 DAG Serialization ○ AIP-32 Airflow REST API ○ AIP-36 DAG Versioning

Slide 49

Slide 49 text

Links ● Airflow ○ Repo: https://github.com/apache/airflow ○ Website: https://airflow.apache.org/ ○ Blog: https://airflow.apache.org/blog/ ○ Documentation: https://airflow.apache.org/docs/stable/ ○ Slack: https://s.apache.org/airflow-slack ○ Twitter: https://twitter.com/apacheairflow ● Contact Me: ○ Twitter: https://twitter.com/kaxil ○ Github: https://github.com/kaxil/ ○ LinkedIn: https://www.linkedin.com/in/kaxil/

Slide 50

Slide 50 text

Thank You!