Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lost Jobs, Zombie Tasks and AirFlow Nightmares:...

Lost Jobs, Zombie Tasks and AirFlow Nightmares: A debugging Deep Dive

Avatar for Posedio

Posedio

March 12, 2025
Tweet

More Decks by Posedio

Other Decks in Programming

Transcript

  1. Do it RIGHT. TAMAS, MEET THE AUDIENCE! 3 Have you

    ever… • Heard of Apache Airflow? • Worked with Apache Airflow? • Managed Airflow instances and infrastructure? • Dig deep into the system, spending hours or even days tracking down issues?
  2. Do it RIGHT. 4 • Brief introduction to Apache Airflow

    Do it RIGHT. LET’S EMBARK ON A JOURNEY!
  3. Do it RIGHT. 5 • Brief introduction to Apache Airflow

    LET’S EMBARK ON A JOURNEY! Do it RIGHT. • Airflow’s Architecture
  4. Do it RIGHT. 6 Do it RIGHT. LET’S EMBARK ON

    A JOURNEY! • Airflow’s Architecture • Execution environments • Brief introduction to Apache Airflow
  5. Do it RIGHT. 7 Do it RIGHT. LET’S EMBARK ON

    A JOURNEY! • Airflow’s Architecture • Execution environments • Managed solutions • Brief introduction to Apache Airflow
  6. Do it RIGHT. 8 Do it RIGHT. LET’S EMBARK ON

    A JOURNEY! • Brief introduction to Apache Airflow • Airflow’s Architecture • Execution environments • Managed solutions • Outage & root cause analysis
  7. Do it RIGHT. APACHE AIRFLOW 11 • Tasks o Single

    unit of work (e.g., extract/transform/load data) o Can be chained (task_A >> task_B) • Directed Acyclic Graph (DAG) o Scheduled chain of tasks, with (optional) inputs. Key Concepts
  8. Do it RIGHT. AIRFLOW’S ARCHITECTURE 15 • User • DAG

    folder • Scheduler & Executor • Web server • Metadata database Overview
  9. Do it RIGHT. AIRFLOW’S ARCHITECTURE 16 1. Scheduler (I.) •

    What happens when you upload your DAG? DAG Task 1 Task 2
  10. Do it RIGHT. AIRFLOW’S ARCHITECTURE 18 • What happens when

    you click “Start”? o Scheduler creates DagRun object. o Analyses task dependencies, creates TaskInstance(s), with scheduled state. o Serialize TaskInstance(s) and submit it to the Executor 3. Scheduler (II.)
  11. Do it RIGHT. AIRFLOW’S ARCHITECTURE 19 • dag table •

    dag_run table • task_instance 4. Metadata database (SQLite, Postgres etc.)
  12. Do it RIGHT. 20 • An executor in Airflow is

    an interface that determines how and where tasks will be executed, serving as the bridge between the scheduler and the actual task execution environment. 5. Executor AIRFLOW’S ARCHITECTURE
  13. Do it RIGHT. 21 • User uploaded the DAG to

    the file-system. • Scheduler o parsed, serialized the DAG and persisted in DB o Identifies DAGs to be executed, broken down to TaskInstances, queued for execution. o Orders the executor to execute task. • Monitor execution via Web server. Let‘s get some fresh air. (Recap) AIRFLOW’S ARCHITECTURE
  14. Do it RIGHT. AIRFLOW’S ARCHITECTURE 22 • An executor in

    Airflow is an interface that determines how and where tasks will be executed. 5. Executors – Take your pick
  15. Do it RIGHT. AIRFLOW’S ARCHITECTURE 23 • An executor in

    Airflow is an interface that determines how and where tasks will be executed. • An Airflow executor must implement these key components: 1. Task Queue Management: Accept and manage task instances from the scheduler 2. Task Execution: Run tasks either directly or by delegating to workers 3. State Tracking: Update and report the state of tasks back to the scheduler 5. Executor
  16. Do it RIGHT. AIRFLOW’S ARCHITECTURE 24 • Maintains a pool

    of worker processes. The “dirty work” is executed on these. • The scheduler pushes tasks to the queued_tasks dictionary. • The executor then periodically checkes it, and processes the task. ~> Moved to running set. • Results are written to the event_buffer • Scheduler fetches tasks state from event_buffer. Buffer is then cleared. 5. Executor: LocalExecutor
  17. Do it RIGHT. 25 • Celery is distributed task-queue, essentially

    an abstraction layer that implements message broker functionality on top of Redis. • Redis is “split” into two parts o QueueBroker: This is where tasks wait to be picked up by workers o ResultBackend: This allows tasks to return values • Scheduler periodically queries the ResultBackend for the status of the task. 5. Executor: CeleryExecutor AIRFLOW’S ARCHITECTURE
  18. Do it RIGHT. 27 • User uploaded the DAG to

    the file-system. • Scheduler o parsed, serialized the DAG and persisted in DB o Identifies DAGs to be executed, broken down to TaskInstances, queued for execution. • Celery Queue is backed by Redis. (QueueBroker and ResultBackend) • Scheduler monitors ResultBackend, and as Tasks are executed, it persists data to the metadata database Recap AIRFLOW’S ARCHITECTURE
  19. Do it RIGHT. EXECUTION ENVIRONMENTS 29 • BashOperator: Executes bash

    command/script… • PythonOperator: Executes Python script… Bash & Python Operators
  20. Do it RIGHT. KUBERNETES 31 "Give a platform engineer Kubernetes,

    and suddenly, every problem starts looking like a cluster.”
  21. Do it RIGHT. EXECUTION ENVIRONMENTS 32 • KubernetesPodOperator: Executes Docker

    image on a K8s cluster. o When executing (Docker) images, as K8s pods, then the pod is created on the cluster. o The worker pod (celery-worker) then monitors the executed (workload) pod. KubernetesOperator Pod
  22. Do it RIGHT. GOOGLE CLOUD COMPOSER 36 With GKE Autopilot

    for execution • GKE Autopilot enables dynamic resource allocation o Users can request resources for their workloads. • Machine types o High o Mega o Ultra o (8GPU, 1128 GB vRAM) o (2952GB RAM, 224 vCPU)
  23. Do it RIGHT. GOOGLE CLOUD COMPOSER 37 With GKE Autopilot

    for execution • GKE Autopilot enables dynamic resource allocation o Users can request resources for their workloads.
  24. Do it RIGHT. ANOTHER DAY, ANOTHER FEATURE RELEASE 40 •

    Context o Our (platform team) provides DAG templates to the data-analyst teams o Added minor feature (“GCS Export”) using plain Python Operator. • Rolled out to “test team” ☑ • Rolled out to all teams…
  25. Do it RIGHT. INVESTIGATION 42 • A TaskInstance stuck in

    running state, while the associated job is being inactive A zombie task is…
  26. Do it RIGHT. INVESTIGATION 43 • I. Job got executed,

    but turned into zombie o The Kubernetes logs were present, but in Airflow these were cut in half. o After 11 minutes, Airflow marks the task as failed (instead of the usual 3-4min execution time) Symptoms & Traces
  27. Do it RIGHT. INVESTIGATION 44 • II. Job started from

    Airflow’s perspective (scheduler queued it), but then completely disappeared from all systems* Symptoms & Traces * In hindsight, hacking myself into the DB, and checking the tables would have been interesting. Running <TaskInstance: myDAG.dbt_test scheduled__2024-11-17T18:00:00+00:00 [queued]> on host airflow-worker-l4b9h
  28. Do it RIGHT. INVESTIGATION 45 • The issue occurred on

    our Cloud Composer instance. • Investigation revealed the existence of a worker-controller. Symptoms & Traces
  29. Do it RIGHT. INVESTIGATION 46 • The worker controller: o

    Evaluates the worker’s health. If it is over utilized (CPU/RAM), spins up another one. • However, the worker must answer a riddle… The worker controller
  30. Do it RIGHT. INVESTIGATION 47 • The worker controller: o

    Evaluates the worker’s health. If it is over utilized (CPU/RAM), spins up another one. • However, the worker must answer a riddle… The worker controller
  31. Do it RIGHT. RTFM 53 One might find that a

    thorough study of the documentation would greatly illuminate the matter at hand.
  32. Do it RIGHT. IT WAS A TOUGH DAY… 56 •

    Tools come at a cost - and that is complexity. o Want to learn more about Celery Executor? o Read the source-code o S. PARK (Korean dude) blog post* o Learn more about worker-controller? o Learn more about worker health checks? * Hint: It’s in Korean…
  33. Do it RIGHT. IT WAS A TOUGH DAY… 57 •

    In the right hands, they can shine o Composer + GKE Autopilot • Consider adding the official documentation to your bedside reading list.