Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow Workflow orchestration without t...

Apache Airflow Workflow orchestration without turning cron into spaghetti

More Decks by NearMeの技術発表資料です

Transcript

  1. 0 Apache Airflow Workflow orchestration without turning cron into spaghetti

    2026-06-05 第148回NearMe技術勉強会 Cyan Chen
  2. 1 The problem 1. Pod Churns = Kubernetes node memory

    leaks 2. The yaml settings of Argo Workflow is hard to control 3. We want to support future data pipeline/ML use cases
  3. 2 What Airflow is Airflow is a workflow orchestrator It

    lets us define workflows as DAGs: directed acyclic graphs. A DAG says: • What tasks exist. • What order they run in. • When they are scheduled. • What happens on failure. • How to retry, backfill, observe, and operate them. Airflow tasks are arranged into DAGs with upstream and downstream dependencies; a task is the basic unit of execution. https://airflow.apache.org/docs/apa che-airflow/stable/index.html
  4. 3 Core components DAG file ↓ parsed by DAG processor

    / scheduler ↓ creates DAG runs ↓ contains Task instances ↓ executed by Executor / workers / Kubernetes pods https://airflow.apache.org/docs/apa che-airflow/stable/core-concepts/ov erview.html#airflow-components
  5. 5 XCom: small data between tasks Airflow’s built-in way for

    tasks to pass small pieces of data to each other. Tasks are isolated, so XCom is mainly for sharing metadata like IDs, paths, counts, flags, or status. Many TaskFlow / operator return values are automatically stored as XComs under the default key return_value. XCom is not for large data such as files, logs, big JSON, or dataframes, because the default backend stores it in the Airflow metadata database. Takeaway: Use XCom for lightweight task coordination, not as a data transport layer.
  6. 6 Scheduling and backfill Airflow cares about time A DAG

    can run: • on a cron schedule • on a preset like @daily • manually • from asset/data dependencies • through backfill
  7. 7 Dynamic task mapping Runtime fan-out Sometimes we do not

    know the number of tasks until runtime. ArgoWorkflow API
  8. 8 Operators vs TaskFlow - Two ways to define work

    TaskFlow Good for Python-native logic: Operators Good for predefined external actions: Use TaskFlow when the task is naturally Python code. Use operators when the task is really “ask another system to do something.”
  9. 9 Hands on # Install `uv`: https://docs.astral.sh/uv/getting-started/installation/ mkdir -p ~/playground/airflow/dags

    mkdir -p ~/playground/airflow/.airflow cd ~/playground/airflow || exit export AIRFLOW_HOME="$PWD/.airflow" export AIRFLOW__CORE__DAGS_FOLDER="$PWD/dags" export AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_PASSWORDS_FILE="$PWD/.airflow/passwords.jso n" echo '{"admin": "admin"}' > "$PWD/.airflow/passwords.json" uvx apache-airflow standalone
  10. 10 Test DAG from airflow.sdk import dag, task import pendulum

    @dag( dag_id="hello_airflow" , start_date=pendulum.datetime(2026, 1, 1, tz="UTC"), schedule=None, catchup=False, ) def hello_airflow(): @task def say_hello(): print("Hello from Airflow!" ) say_hello() hello_airflow() open http://localhost:8080