Slide 1

Slide 1 text

Airflow: Deferrable “Async” Operators How you can save tons of money!! Kaxil Naik Open Source Summit 2022

Slide 2

Slide 2 text

Who am I? ● Committer & PMC Member of Apache Airflow ● Director of Airflow Engineering @ Astronomer @kaxil

Slide 3

Slide 3 text

What is Apache Airflow?

Slide 4

Slide 4 text

A platform to programmatically author, schedule, and monitor workflows

Slide 5

Slide 5 text

Example DAG

Slide 6

Slide 6 text

Why deferrable operators? The problem around current operators & sensors What are they? And how do they work? Available async operators How to find & use the available async operators

Slide 7

Slide 7 text

Why deferrable operators?

Slide 8

Slide 8 text

Scheduler Time Scheduler Worker

Slide 9

Slide 9 text

Scheduler Time Scheduler Submit Spark Job Typical Operator Poll Spark cluster Job Completion Worker

Slide 10

Slide 10 text

Scheduler Time Scheduler Typical Sensor Wait for files to arrive in S3 Arrived! Worker

Slide 11

Slide 11 text

Scheduler Time Scheduler What a waste! Wait for files to arrive in S3 Arrived! Worker Submit Spark Job Poll Spark cluster Job Completion Wasted resources Wasted resources Operator Sensor

Slide 12

Slide 12 text

Scheduler Time Scheduler What a waste! Wait for files to arrive in S3 Arrived! Worker Submit Spark Job Poll Spark cluster Job Completion Wasted resources Wasted resources Operator Sensor

Slide 13

Slide 13 text

Time Polling for Spark Job completion Done! Multiple Sensors Wait for files to arrive in S3 Done! Polling for Bigquery Job completion Done! Polling for Spark Job completion Done! Wait for files to arrive in S3 Done! Wait for files to arrive in GCS Done! Imagine the Cost! Multiple worker slots Blocked !!

Slide 14

Slide 14 text

What are deferrable operators?

Slide 15

Slide 15 text

Scheduler Time Scheduler Worker Triggerer Worker

Slide 16

Slide 16 text

Scheduler Time Scheduler Submit Spark Job Async Operator Free slot Job Completion Worker Poll Spark cluster Triggerer

Slide 17

Slide 17 text

Scheduler Time Scheduler Task 1 Worker Slot (Sync vs Async) Task 2 Sync Operator Async Operator Task 1 Task 2 Task 3 Task 2 Task 1 Task 4

Slide 18

Slide 18 text

Scheduler Time Scheduler Submit Spark Job Async Operator Free slot Job Completion Worker Poll Spark cluster Triggerer Trigger - a new concept

Slide 19

Slide 19 text

Task runs on the Worker, then “defers” itself Example: Submit an API call and stores job_id in DB Runs on Triggerer Async polling until the criteria is met, stores response in DB Back to the Worker To show response in the logs and set Task state

Slide 20

Slide 20 text

Trigger is different than Operator

Slide 21

Slide 21 text

Must be asynchronous & quick So Triggerer can run thousands of them per CPU core Should not persistent state So we can shuffle them around between Triggerers as needed Must support running multiple copies of itself For reliability during network partitions

Slide 22

Slide 22 text

class DateTimeTrigger(BaseTrigger): def __init__(self, moment: datetime.datetime): super().__init__() self.moment = moment def serialize(self): return ("mymodule.DateTimeTrigger", {"moment": self.moment}) async def run(self): while self.moment > timezone.utcnow(): await asyncio.sleep(1) yield TriggerEvent(self.moment)

Slide 23

Slide 23 text

class WaitOneHourSensor(BaseSensorOperator): def execute(self, context): self.defer( trigger=TimeDeltaTrigger(timedelta(hours=1)), method_name="execute_complete", ) def execute_complete(self, context, event=None): # We have no more work to do here. Mark as complete. return

Slide 24

Slide 24 text

Available Deferrable Operators

Slide 25

Slide 25 text

Core Airflow & Providers https://github.com/apache/airflow/ Astronomer Providers https://github.com/astronomer/astronomer-providers 50+ async operators

Slide 26

Slide 26 text

Astronomer Providers https://github.com/astronomer/astronomer-providers 50+ async operators Built and maintained with ❤ by Astronomer Apache 2 licensed & Open-source Drop-in replacements of “sync” Operators Openlineage support

Slide 27

Slide 27 text

Astronomer Providers https://github.com/astronomer/astronomer-providers Async Operators available for: ● AWS, Google Cloud, Microsoft Azure ● Databricks, Snowflake ● Kubernetes, Apache Livy, Apache Hive ● HTTP, Filesystem from astronomer.providers.amazon.aws.sensors.s3 import S3KeySensorAsync as S3KeySensor

Slide 28

Slide 28 text

from astronomer.providers.amazon.aws.sensors.s3 import S3KeySensorAsync as S3KeySensor waiting_for_s3_key = S3KeySensor( task_id="waiting_for_s3_key" , bucket_key="sample_key.txt" , wildcard_match =False, bucket_name="sample-bucket" , )

Slide 29

Slide 29 text

Caveats when building one

Slide 30

Slide 30 text

More cost benefits when polling takes longer time We've seen over a 90% reduction in resources for a 10 min wait Not everything can be deferred It must be an external event/system with a unique identifier Triggerer logs are not visible in the Webserver/UI Only task logs from workers are displayed on UI

Slide 31

Slide 31 text

Thank You @kaxil