Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solving Multi-Tenant Challenges: Apache Airflow...

Posedio
October 24, 2024
16

Solving Multi-Tenant Challenges: Apache Airflow and Cloud Composer in Action

Posedio

October 24, 2024
Tweet

Transcript

  1. 2 WHO AM I? • Studied in Budapest & Munich

    • Settled in Vienna • Backend developer -> Platform engineer • Data Platform 2022+ • https://speakerdeck.com/posedio/ • Endurance sports ( ) • For more catch me @ Vienna Marathon • Owner of two Dogs ( ) Tamas
  2. 5 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) AGENDA
  3. 6 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) 4. Cloud Composer AGENDA
  4. 7 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) 4. Cloud Composer 5. Multi-tenancy in Airflow AGENDA
  5. 8 1. Airflow 101 2. Do I need an orchestrator?

    3. Airflow’s architecture (“the cost”) 4. Cloud Composer 5. Multi-tenancy in Airflow AGENDA
  6. 10 APACHE AIRFLOW • 2015: Airbnb needed a tool to

    author, iterate, and monitor batch data pipelines ( Article) • 2016: Airflow joined Apache Foundation • 2019: Airflow graduated • 2020+: Airflow 2.0 released, growing community
  7. 12 ASTRONOMER • 2018: SaaS offering for Airflow called Astro

    • Solid documentation on AirFlow • Integration of 3rd party tools (dbt ~ data build tool) • Enterprise-tier features for AirFlow (multi-tenancy with work-spaces etc.)
  8. 17 SESAME INC. • Inputs: • Recipe • Ingredients (~Stock

    level) • Processes • Configure mixer • Configure oven • Configure packaging • Start production • 23:00
  9. 19 AIRFLOW • DAG (directed acyclic graph) • Tasks: •

    Fetch recipe and stock • Configure mixer • Configure oven • Start production • Monitor packaging
  10. 20 AIRFLOW • DAG (directed acyclic graph) • Tasks: •

    Fetch recipe and stock • Configure mixer • Configure oven • Start production • Monitor packaging
  11. 21 CRON • 0 8 * * * /usr/bin/python start.py

    • Steps: • Fetch recipe and stock levels • Configure mixer • Configure oven • Start production • Monitor packaging
  12. 22 CATCHUP • Machine didn’t work over weekend => Sat,

    Sun, (Mon) cookies are missing. • Cron: ??? • AirFlow:
  13. 23 BACKFILL • While the cookies were stored in the

    storage room, they have gone bad. Can we re-run the batch? • Cron: ??? • AirFlow:
  14. 25 SENSORS • Checks if a condition is met at

    a specific interval. • timeout • poke_interaval • Types • GCS (files) • SQL • …
  15. 36 CRON VS AIRFLOW • Job monitoring (UI vs CRON-log)

    • Job overview (Graph vs 1000s of lines) • Business logic decouples from scheduling • Efficient remote execution (K8s Pods) • Comes at a price of having to manage Airflow…
  16. 42 LET‘S ONBOARD! Guy with a shady use-case Business Intelligence

    Finance I don‘t even know who you are… Data Analysts Her friends…
  17. 44 LIMITING ACCESS: PER-FOLDER ROLE REGISTRATION • rbac_autoregister_per_folder_roles -> True

    • Default role: UserNoDags • Airflow now generates roles: finance, da
  18. 55 LIMITING ACCESS 1. Ensure DAG lands in correct folder

    2. (Optionally) simulate DAG parsing & loading 3. Cluster Policies ~> Validate and/or mutate
  19. 56 WRAPPING UP! • If you find yourself hand-crafting orchestration

    logic, consider frameworks (Airflow, Dagster, Prefect) • To reduce OpEx consider managed solutions (Cloud Composer ) • Look into the K8sPodOperator • Avoid YAML-hell • Use hooks and cherish the joy of Python