Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow: Synchronizing Datasets across M...

Avatar for Seb Seb
September 04, 2025
140

Apache Airflow: Synchronizing Datasets across Multiple instances

Avatar for Seb

Seb

September 04, 2025
Tweet

Transcript

  1. Who Am I ? • Data Engineer @ Numberly •

    Airflow user since 2018 • French & Mexican • EuroPython 2023 Speaker ◦ Orchestrating Python Workflows in Apache Airflow • Lived in Taiwan for ~5 years
  2. Open Source & Community • Apache Foundation project → License

    & Management • Active developers • Active Community : Slack, Stack Overflow, Reddit, etc…
  3. Open Source & Community • Apache Foundation project → License

    & Management • Active developers • Active Community : Slack, Stack Overflow, Reddit, etc… • Global event → Airflow Summit
  4. Fixed Schedule Using crontab notation: Minute Hour DayOfMonth Month DayOfWeek

    Example: 0 8 1 * * → On the first day of each month at 8:00 AM
  5. Fixed Schedule • Most commonly used • Easy to use

    • Requires extra features or workarounds for some use cases: ◦ Daylight Saving Time ✅ ◦ UTC ✅ ◦ Holidays ≈ • Not Dynamic
  6. Event Driven Scheduling • Execution that follows external events •

    More complex to set up • 1 to 1 execution • Difficult issues: ◦ Order & Race conditions ◦ Ensuring deliverability ◦ Duplicate events
  7. Data-Aware Scheduling • Execution that follows changes in Data Source

    • Execute if Data Asset has changed • Similar to Event Driven but with lower constraints ◦ 1 to 1 execution not guaranteed ◦ Order of events shouldn’t be important
  8. Airflow Datasets (Data Assets) • It is just a string

    • Represent Data Asset • Maintaining link is YOUR responsibility
  9. When is it needed? • Machine performance ◦ High CPU

    and RAM usage ◦ Scheduler heartbeat failures ◦ dag_processing.total_parse_time is high • Slow scheduling ◦ Many tasks in “queued” or ”scheduled” state ◦ Long queue length ◦ Delayed DAG runs • Database limitations
  10. Airflow Remote Executors Pros: • Robust: decoupling workers from scheduler

    process • Effective: Worker parallelization • Available: Low latency workers always running
  11. Separate Instances Pros: • Maximum scalability • Separation of Concerns

    • Better Access Control Just have 2 Airflows
  12. Separate Instances Cons: • Even more work • Even more

    resources • Redundant operations Just have 2 Airflows ?
  13. Working Environment Older instance • Airflow 2.4 • 1000+ DAGs

    • Frozen Younger instance • Airflow 2.10 • • Active development
  14. The Hacky Solution • 1 DAG per instance ◦ Checks

    for new Data Asset Events ◦ Pushes to other instances via API • Requires complex error handling • N*(N-1) messages exchanged
  15. Implementation Candidates • Using custom "high-frequency" DAGs • Using Airflow

    Listeners • Using an external service via the Airflow API
  16. Implementation Candidates • Using custom "high-frequency" DAGs • Using Airflow

    Listeners • Using an external service via the Airflow API
  17. Implementation details • Single source of truth (Kafka) • Separation

    of concerns (Consumer/Producer) • Saved offsets ◦ Producer: Airflow Variable ◦ Consumer: Kafka Offset
  18. Implementation details • Single source of truth (Kafka) • Separation

    of concerns (Consumer/Producer) • Saved offsets ◦ Producer: Airflow Variable ◦ Consumer: Kafka Offset • Polling done via API
  19. Airflow API issues • Airflow 2.4 API does NOT have

    a POST endpoint for Dataset Events • Airflow 2.9 API Dataset Events POST endpoint does NOT create new Datasets
  20. Airflow parsed object vs created objects • Parsed objects ◦

    DAGs, Tasks, Datasets… • Created objects ◦ Dag runs, Dataset Events, XCOM… What happens when your files and your Database don’t agree?
  21. DatasetAlias (Asset Alias) Dataset Alternative with following characteristics: • Dynamic

    • Can create Multiple Datasets at once Use if you don’t know which assets will be used at Design Time
  22. External Service consumer 2.10 2.4 • Airflow API ◦ POST

    dataset/events • DatasetAlias DAG • Custom DAG
  23. New functionalities • New API Endpoint: Materialize asset, solves Asset

    pushing issues • Event Driven Scheduling: New polling mechanism for scheduling ◦ AssetWatcher: monitors external event source • Cool New UI