Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solving Multi-Tenant Challenges: Apache Airflow...

Avatar for Posedio Posedio PRO
October 24, 2024
60

Solving Multi-Tenant Challenges: Apache Airflow and Cloud Composer in Action

Join us and explore the world of cloud data platforms and see how Apache Airflow, an open-source workflow orchestrator, fits into the broader ecosystem. We'll look at the challenges we encountered while managing Airflow in a multi-tenant environment and share how Cloud Composer, Google's managed Apache Airflow solution, helped us overcome some of those hurdles. Finally, we will briefly touch upon some of Airflow's most exciting features, that allowed our tenants to efficiently create and schedule workflows in a standardized way.

Avatar for Posedio

Posedio PRO

October 24, 2024
Tweet

Transcript

  1. 2 WHO AM I? • Studied in Budapest & Munich

    • Settled in Vienna • Backend developer -> Platform engineer • Data Platform 2022+ • https://speakerdeck.com/posedio/ • Endurance sports ( ) • For more catch me @ Vienna Marathon • Owner of two Dogs ( ) Tamas
  2. 5 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) AGENDA
  3. 6 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) 4. Cloud Composer AGENDA
  4. 7 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) 4. Cloud Composer 5. Multi-tenancy in Airflow AGENDA
  5. 8 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) 4. Cloud Composer 5. Multi-tenancy in Airflow AGENDA
  6. 10 APACHE AIRFLOW • 2015: Airbnb needed a tool to

    author, iterate, and monitor batch data pipelines ( Article) • 2016: Airflow joined Apache Foundation • 2019: Airflow graduated • 2020+: Airflow 2.0 released, growing community
  7. 12 ASTRONOMER • 2018: SaaS offering for Airflow called Astro

    • Solid documentation on AirFlow • Integration of 3rd party tools (dbt ~ data build tool) • Enterprise-tier features for AirFlow (multi-tenancy with work-spaces etc.)
  8. 17 SESAME INC. • Inputs: • Recipe • Ingredients (~Stock

    level) • Processes • Configure mixer • Configure oven • Configure packaging • Start production • 23:00
  9. 19 AIRFLOW • DAG (directed acyclic graph) • Tasks: •

    Fetch recipe and stock • Configure mixer • Configure oven • Start production • Monitor packaging
  10. 20 AIRFLOW • DAG (directed acyclic graph) • Tasks: •

    Fetch recipe and stock • Configure mixer • Configure oven • Start production • Monitor packaging
  11. 21 CRON • 0 8 * * * /usr/bin/python start.py

    • Steps: • Fetch recipe and stock levels • Configure mixer • Configure oven • Start production • Monitor packaging
  12. 22 CATCHUP • Machine didn’t work over weekend => Sat,

    Sun, (Mon) cookies are missing. • Cron: ??? • AirFlow:
  13. 23 BACKFILL • While the cookies were stored in the

    storage room, they have gone bad. Can we re-run the batch? • Cron: ??? • AirFlow:
  14. 25 SENSORS • Checks if a condition is met at

    a specific interval. • timeout • poke_interaval • Types • GCS (files) • SQL • …
  15. 36 CRON VS AIRFLOW • Job monitoring (UI vs CRON-log)

    • Job overview (Graph vs 1000s of lines) • Business logic decouples from scheduling • Efficient remote execution (K8s Pods) • Comes at a price of having to manage Airflow…
  16. 42 LET‘S ONBOARD! Guy with a shady use-case Business Intelligence

    Finance I don‘t even know who you are… Data Analysts Her friends…
  17. 44 LIMITING ACCESS: PER-FOLDER ROLE REGISTRATION • rbac_autoregister_per_folder_roles -> True

    • Default role: UserNoDags • Airflow now generates roles: finance, da
  18. 55 LIMITING ACCESS 1. Ensure DAG lands in correct folder

    2. (Optionally) simulate DAG parsing & loading 3. Cluster Policies ~> Validate and/or mutate
  19. 56 WRAPPING UP! • If you find yourself hand-crafting orchestration

    logic, consider frameworks (Airflow, Dagster, Prefect) • To reduce OpEx consider managed solutions (Cloud Composer ) • Look into the K8sPodOperator • Avoid YAML-hell • Use hooks and cherish the joy of Python