Slide 1

Slide 1 text

Airflow Best Practises & Roadmap to Airflow 2.0 Kaxil Naik - Airflow PMC, Core Committer & Release Manager. Senior Data Engineer @ Astronomer.io

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

What’s new in Airflow 1.10.8 / 1.10.9 ?

Slide 4

Slide 4 text

Add tags to DAGs and use it for filtering

Slide 5

Slide 5 text

Add tags to DAGs and use it for filtering

Slide 6

Slide 6 text

Allow passing conf in “Add DAG Run” view

Slide 7

Slide 7 text

Allow dags to run for future execution dates ● Only works for DAGs with no schedule_interval ● Useful for companies working on several timezones and relying on External Triggers ● Enable this feature with the following Env variable: AIRFLOW__SCHEDULER__ALLOW_TRIGGER_IN_FUTURE=True

Slide 8

Slide 8 text

And much more.. ● Several bug fixes ● Docs Improvements ● Complete changelog at https://github.com/apache/airflow/blob/1.10.8/CHANGELOG.txt

Slide 9

Slide 9 text

Tips & Best Practises

Slide 10

Slide 10 text

Writing DAGs

Slide 11

Slide 11 text

Use DAG as Context Manager ● Use Context Manager to assign a task to a particular DAG.

Slide 12

Slide 12 text

DAG without context Manager

Slide 13

Slide 13 text

DAG with context Manager

Slide 14

Slide 14 text

Using List to set Task dependencies

Slide 15

Slide 15 text

Using List to set Task dependencies Normal Way Being a Pro !

Slide 16

Slide 16 text

Use default_args to avoid repeating arguments ● Airflow allows passing a dictionary of arguments that would be available to all the task in that DAG.

Slide 17

Slide 17 text

Use default_args to avoid repeating arguments

Slide 18

Slide 18 text

The “params” argument ● “params” is a dictionary of DAG level parameters made accessible in templates.

Slide 19

Slide 19 text

The “params” argument ● These params can be overridden at the task level. ● Ideal for writing parameterized DAG.

Slide 20

Slide 20 text

Store Sensitive data in Connections ● Don’t put passwords in your DAG files! ● Use Airflow connections to store any kind of sensitive data like Passwords, private keys, etc ● Airflow stores the connections data in Airflow MetadataDB ● If you install “crypto” package (“pip install apache-airflow[crypto]”), password field in Connections would be encrypted in DB too.

Slide 21

Slide 21 text

Store Sensitive data in Connections

Slide 22

Slide 22 text

Restrict the number of Airflow variables in your DAG ● Any call to variables would mean a connection to Metadata DB. ● Your DAG files are parsed every X seconds. Using a large number of variable in your DAG may mean you might end up saturating the number of allowed connections to your database.

Slide 23

Slide 23 text

Restrict the number of Airflow variables in your DAG ● Use Environment Variables instead ● Or have a single Airflow variable per DAG and store all the values as JSON

Slide 24

Slide 24 text

Restrict the number of Airflow variables in your DAG

Slide 25

Slide 25 text

Restrict the number of Airflow variables in your DAG ● Access them by deserializing JSON

Slide 26

Slide 26 text

Avoid code outside of an operator in your DAG files ● Airflow will parse your DAG many times over and over (and more often than your schedule interval), and any code at the top level of the file will get run. ● Can cause Scheduler to be slow and hence task might end up being delayed

Slide 27

Slide 27 text

Stop using Python 2 ● Python 2 reached end of its life on Jan 2020 ● We have dropped Python 2 support on Airflow Master branch ● Airflow 1.10.* is the last series to support Python 2

Slide 28

Slide 28 text

Use Flask-Appbuilder based UI ● Enabled using “rbac=True” under “[webserver]” ● Airflow ships with a set of roles by default: Admin, User, Op, Viewer, and Public ● Creating custom roles is possible ● DAG Level Access Control. User can declare the read or write permission inside the DAG file as shown below

Slide 29

Slide 29 text

Use Flask-Appbuilder based UI ● Flask-Appbuilder based UI would be the default UI from Airflow 2.0 ● Old Flask-admin based UI will be removed in 2.0. It is already removed on Airflow Master

Slide 30

Slide 30 text

Configuring Airflow for Production

Slide 31

Slide 31 text

Pick an Executor ● SequentialExecutor - Runs tasks sequentially ● LocalExecutor - Runs tasks parallely on same machine using subprocesses ● CeleryExecutor - Runs tasks parallely on different worker machines ● KubernetesExecutor - Runs tasks on separate Kubernetes pods

Slide 32

Slide 32 text

Set configs using environment variables ● Config: AIRFLOW__${SECTION}__${NAME} Example: AIRFLOW__CORE__SQL_ALCHEMY_CONN ● Connections: AIRFLOW_CONN_${CONN_ID} Example: AIRFLOW_CONN_BIGQUERY_PROD

Slide 33

Slide 33 text

Apply migrations using “airflow upgradedb” ● Run “airflow upgradedb” instead of “airflow initdb” on your PROD cluster ● “initdb” creates example connections too along with migrations ● “upgradedb” only applies migrations

Slide 34

Slide 34 text

Enforce Policies ● To define policy, add a airflow_local_settings module to your PYTHONPATH that defines this policy function. ● It receives a TaskInstance object and can alter it where needed. ● Example Usages: ○ Enforce a specific queue (say the spark queue) for tasks using the SparkOperator to make sure that these task instances get wired to the right workers ○ Force all task instances running on an execution_date older than a week old to run in a backfil` pool.

Slide 35

Slide 35 text

Airflow 2.0 Roadmap

Slide 36

Slide 36 text

Airflow 2.0 Roadmap ● Dag Serialization ● Revamped real-time UI ● Production-grade modern API ● Official Docker Image & Helm chart ● Scheduler Improvements ● Data Lineage

Slide 37

Slide 37 text

Dag Serialization

Slide 38

Slide 38 text

Dag Serialization ● Make Webserver stateless ● DAGs parsed by Scheduler and stored in DB from where Webserver reads ● Phase-1 implemented and released in Airflow >= 1.10.7 ● For Airflow 2.0 we want the Scheduler to read from DB as well and pass on the responsibility of parsing DAG and saving it to DB to “Serializer” or some other component.

Slide 39

Slide 39 text

Revamped real-time UI ● No refreshing the page manually to check the status !! ● Modern design ● Planning to use React to build the UI ● Use APIs for communication not DB/file access

Slide 40

Slide 40 text

Production-grade modern API ● API has been experimental since a long time ● CLI & webserver should be using the API instead of duplicating code ● Better Authentication/Authorization ● Conform to OpenAPI standards

Slide 41

Slide 41 text

Official Docker Image & Helm chart ● Currently the popular solution is “puckel-airflow” docker image and the stable Airflow chart in Helm Repo. ● However, we want to support all features and make the official image and Helm chart and support it.

Slide 42

Slide 42 text

Thanks

Slide 43

Slide 43 text

We are Hiring ! Visit https://careers.astronomer.io/