Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Airflow: Save Tons of Money by Using Deferrable Operators

Kaxil Naik
September 14, 2022

Airflow: Save Tons of Money by Using Deferrable Operators

This talk is from Open Source Summit 2022

Apache Airflow 2.2 introduced the concept of Deferrable Tasks that uses Python's async feature.

All the Airflow sensors and poll-based operators can be hugely optimized to save tons of money by freeing up worker slots when polling.

This session will cover the following topics: - Introduction to the concept of deferrable operator

- Why do we need them?
- When to use them?
- How does it work?
- Writing Custom deferrable operators & Sensors

Kaxil Naik

September 14, 2022
Tweet

More Decks by Kaxil Naik

Other Decks in Programming

Transcript

  1. Airflow: Deferrable “Async” Operators How you can save tons of

    money!! Kaxil Naik Open Source Summit 2022
  2. Who am I? • Committer & PMC Member of Apache

    Airflow • Director of Airflow Engineering @ Astronomer @kaxil
  3. Why deferrable operators? The problem around current operators & sensors

    What are they? And how do they work? Available async operators How to find & use the available async operators
  4. Scheduler Time Scheduler What a waste! Wait for files to

    arrive in S3 Arrived! Worker Submit Spark Job Poll Spark cluster Job Completion Wasted resources Wasted resources Operator Sensor
  5. Scheduler Time Scheduler What a waste! Wait for files to

    arrive in S3 Arrived! Worker Submit Spark Job Poll Spark cluster Job Completion Wasted resources Wasted resources Operator Sensor
  6. Time Polling for Spark Job completion Done! Multiple Sensors Wait

    for files to arrive in S3 Done! Polling for Bigquery Job completion Done! Polling for Spark Job completion Done! Wait for files to arrive in S3 Done! Wait for files to arrive in GCS Done! Imagine the Cost! Multiple worker slots Blocked !!
  7. Scheduler Time Scheduler Submit Spark Job Async Operator Free slot

    Job Completion Worker Poll Spark cluster Triggerer
  8. Scheduler Time Scheduler Task 1 Worker Slot (Sync vs Async)

    Task 2 Sync Operator Async Operator Task 1 Task 2 Task 3 Task 2 Task 1 Task 4
  9. Scheduler Time Scheduler Submit Spark Job Async Operator Free slot

    Job Completion Worker Poll Spark cluster Triggerer Trigger - a new concept
  10. Task runs on the Worker, then “defers” itself Example: Submit

    an API call and stores job_id in DB Runs on Triggerer Async polling until the criteria is met, stores response in DB Back to the Worker To show response in the logs and set Task state
  11. Must be asynchronous & quick So Triggerer can run thousands

    of them per CPU core Should not persistent state So we can shuffle them around between Triggerers as needed Must support running multiple copies of itself For reliability during network partitions
  12. class DateTimeTrigger(BaseTrigger): def __init__(self, moment: datetime.datetime): super().__init__() self.moment = moment

    def serialize(self): return ("mymodule.DateTimeTrigger", {"moment": self.moment}) async def run(self): while self.moment > timezone.utcnow(): await asyncio.sleep(1) yield TriggerEvent(self.moment)
  13. Astronomer Providers https://github.com/astronomer/astronomer-providers 50+ async operators Built and maintained with

    ❤ by Astronomer Apache 2 licensed & Open-source Drop-in replacements of “sync” Operators Openlineage support
  14. Astronomer Providers https://github.com/astronomer/astronomer-providers Async Operators available for: • AWS, Google

    Cloud, Microsoft Azure • Databricks, Snowflake • Kubernetes, Apache Livy, Apache Hive • HTTP, Filesystem from astronomer.providers.amazon.aws.sensors.s3 import S3KeySensorAsync as S3KeySensor
  15. More cost benefits when polling takes longer time We've seen

    over a 90% reduction in resources for a 10 min wait Not everything can be deferred It must be an external event/system with a unique identifier Triggerer logs are not visible in the Webserver/UI Only task logs from workers are displayed on UI