Slide 1

Slide 1 text

Orchestrating the Future Navigating Today's Data Workflow Challenges with Airflow and Beyond Budapest Data + ML Forum June 2024

Slide 2

Slide 2 text

Kaxil Naik Apache Airflow Committer & PMC Member Senior Director of Engineering @ Astronomer @kaxil @kaxil @kaxil

Slide 3

Slide 3 text

● Orchestrator – The What & Why? ● What is Apache Airflow? ○ Why is Airflow the Industry Standard for Data Professionals? ○ Evolution of Airflow ● Today’s Data Workflow Challenges ○ How Airflow addresses them – Real world case studies ● The Future of Airflow Agenda

Slide 4

Slide 4 text

Orchestrator The What & Why?

Slide 5

Slide 5 text

What is Orchestration? Who is an Orchestrator?

Slide 6

Slide 6 text

Why Orchestration?

Slide 7

Slide 7 text

Orchestration in Engineering! Workflow Orchestrator Automates and manages interconnected tasks across various systems to streamline complex business processes. E.g Running bash script everyday to update packages on a laptop. Data Orchestrator Automates and manages interconnected tasks that deal with data across various systems to streamline complex business processes. E.g ETL for a BI dashboard.

Slide 8

Slide 8 text

What is Apache Airflow?

Slide 9

Slide 9 text

A Workflow Orchestrator, most commonly used for Data Orchestration Official Definition: A platform to programmatically author, schedule and monitor workflows What is Apache Airflow?

Slide 10

Slide 10 text

Python Native The language of data professionals (Data Engineers & Scientists). DAGs are defined in code: allowing more flexibility & observability of code changes when used with git. Pluggable Compute GPUs, Kubernetes, EC2, VMs etc. Integrates with Toolkit All data sources, all Python libraries, TensorFlow, SageMaker, MLFlow, Spark, Ray, etc. Common Interface Between Data Engineering, Data Science, ML Engineering and Operations. Data Agnostic But data aware. Cloud Native But cloud neutral. Monitoring & Alerting Built in features for logging, monitoring and alerting to external systems. Extensible Standardize custom operators and templates for common DS tasks across the organization. Key Features of Airflow

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Example DAG

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Why is Airflow the Industry Standard for Data Professionals?

Slide 15

Slide 15 text

25M Monthly Downloads The Community 2.9K Contributors 35K GitHub Stars 47K Slack Community

Slide 16

Slide 16 text

Under …

Slide 17

Slide 17 text

Governed by Committers 33 PMC Members Project Management Committee 62

Slide 18

Slide 18 text

Integrations And ……

Slide 19

Slide 19 text

90+ Providers

Slide 20

Slide 20 text

Docker Image docker pull apache/airflow

Slide 21

Slide 21 text

Helm Chart helm repo add apache-airflow https://airflow.apache.org/ helm install my-airflow apache-airflow/airflow

Slide 22

Slide 22 text

Conference & Meetups Attendees: Online Edition (2020-2022): 10k In-person (2023+): 500+ 15 Local Groups across the globe with 11k members

Slide 23

Slide 23 text

Managed Airflow Vendors

Slide 24

Slide 24 text

Airflow Survey and State of Apache Airflow report Infographic: https://airflow.apache.org/survey/ Report: https://www.astronomer.io/state-of-airflow/

Slide 25

Slide 25 text

Use cases for Airflow Ingestion and ETL/ELT related to business operations 0% 25% Source: 2023 Apache Airflow Survey, n=797 13% 90% of Apache Airflow usage is dedicated to ingestion and ETL/ELT tasks associated with analytics, followed by 68% for business operations. Additionally, there’s a growing adoption for MLOps (28%) and infrastructure management (13%), highlighting its versatility across various data workflow tasks. 50% 100% 90% 68% 28% Ingestion and ETL/ELT related to analytics Training, serving, or generally manage MLOps Spinning up and spinning down infrastructure Other 3% 75%

Slide 26

Slide 26 text

The Evolution of Airflow

Slide 27

Slide 27 text

Timeline: Major Milestones 2014 Oct Created at AirBnb 2016 March Donated to the Apache Software Foundation (ASF) as an Incubating project 2020 Dec Airflow 2.0 released 2015 June Open Sourced 2018 Dec Graduated as a top-level project 2025 Mar-Apr (Planned) Airflow 3.0 release 2020 July First Airflow Summit

Slide 28

Slide 28 text

Timeline: 2.x Minor Releases 2.1 2021-05 2.3 2022-05 2.2 2021-11 2.4 2022-09 2.5 2022-11 2.6 2023-04 2.7 2023-08 2.8 2023-12 2.9 2024-04

Slide 29

Slide 29 text

Code Contributions & downloads continue to grow! Downloads: 500K / month Downloads: 25M / month

Slide 30

Slide 30 text

Today’s Data Workflow Challenges

Slide 31

Slide 31 text

Today’s Data Workflow challenges Increasing Data Volumes Businesses generates more data than ever. Handling this data & its quality is critical. Need for near Real-time Processing Data Workflows are being used to drive critical business decisions in near real-time & hence requiring reliability & performance guarantees. Complexity in Data Workflows Modern workflows need handling data from multiple sources that require managing complex deps & dynamic schedules. Intelligent Infrastructure Infrastructure must be elastic & flexible to optimize for a modern workloads.

Slide 32

Slide 32 text

Today’s Data Workflow challenges Additional Interfaces Net-new teams- from ML to AI - want to get the best out of Airflow without learning a new framework. Licensing & Security in OSS OSS projects owned by a single company have changed licenses too often in recent past. Platform Governance Visibility, auditability, & lineage across a data platform is need-to-have. Cost Reduction Tight budgets have pushed teams to efficiently utilize the resources to drive operational costs down.

Slide 33

Slide 33 text

How does Airflow address these challenges?

Slide 34

Slide 34 text

Case Study: Texas Rangers Company: A professional baseball team in Major League Baseball (MLB), based in Arlington, Texas. The Rangers won their first World Series championship in 2023. Goal: Use data to gain unfair advantage, Moneyball style! Data to be collected: real-time game data streaming, comprehensive player health reporting, predictive analytics of everything from pitch spin to hit trajectory, and more Challenge: Scalability issues due to volume & unprecedented rate of data & infra bottleneck in their live game analytics pipeline. This impacted the timely delivery of analytics to their team and affected their competitive edge.

Slide 35

Slide 35 text

Case Study: Texas Rangers Solution: Use Airflow’s worker queues to create dedicated worker pools for CPU-intensive tasks while other tasks used cheaper workers. Using Data-aware Scheduling, they were able to start their DAGs when data was available instead of time-based scheduling. Result: Improved Scalability Using worker queues, DAG completion time reduced by 80% (from 20 mins to 3 mins) Increased Efficiency Optimizing compute resources allowed processing of 4 additional DAGs in parallel, enabling immediate post-game analytics delivery for a competitive edge.

Slide 36

Slide 36 text

Case Study: Bloomberg Company: Bloomberg is a leading source for financial & economic data: Equities, bonds, Index, Mortgages, currencies, etc. Founded in 1981 with subscribers in 170+ countries. Goal: Deliver a diverse array of information, news & analytics to facilitate decision-making Challenge: Maintaining custom pipelines for diverse datasets of different domains is expensive & time consuming. Their engineers lacked domain knowledge to aggregate data into client insights & their domain experts lack skills to maintain data pipelines in Production.

Slide 37

Slide 37 text

Case Study: Bloomberg Solution: Configuration-driven ETL platform leveraging Airflow & dynamic DAGs. User-defined configs are translated into Dynamic DAGs determining tasks & their dependencies with success/failure actions. Result: The Data Platform teams now supports 1600+ DAGs, 700+ datasets, 200+ users, 11 different product teams, 10k+ weekly file ingestions Source: https://airflowsummit.org/sessions/2023/airflow-at-bloomberg-leveraging-dynamic-dags-for-data-ingestion/

Slide 38

Slide 38 text

Case Study: Company: FanDuel Group is a sports betting company that lives on data with approx 17 million customers. Goal: Business growth led to higher daily data volumes, which fueled demand for new sources and richer analytics. Challenge: 2022 NFL season was fast approaching and FanDuel wanted a robust data architecture in anticipation of company’s busiest time in terms of daily volume of data.

Slide 39

Slide 39 text

Case Study: Solution: They worked with Astro professional services team to replace Operators with more efficient Deferrable Operators along with Astro’s auto-scaling features. Result: The number of worker nodes running on avg decreased by 35%, resulting in immediate infrastructure cost savings & average tasks per worker increased by 305%

Slide 40

Slide 40 text

Other Interesting Case Studies Grindr has saved $600,000 in Snowflake costs by monitoring their Snowflake usage across the organization with Airflow. Condé Nast has reduced costs by 54% by using deferrable operators. Airline: a tool powered by Airflow, built by Astronomer’s Customer Reliability Engineering (CRE) team, that monitors Airflow deployments and sends alerts proactively when issues arise.

Slide 41

Slide 41 text

Other Interesting Case Studies King uses ‘data reliability engineering as code’ tools such as SodaCore within Airflow pipelines to detect, diagnose and inform about data issues to create coverage, improve quality & accuracy and help eliminate data downtime. Laurel.ai: A pioneering AI company that automates time and billing for professional services. Uses multiple domain-specific LLMs to create billing timesheets from users’s footprints across their workflows & tools (Zoom, MS Teams etc). Airflow orchestrates their entire GenAI lifecycle: data extraction, model tuning & feedback loops. Ask Astro: An end-to-end example of a Q&A LLM application used to answer questions about Apache Airflow and Astronomer

Slide 42

Slide 42 text

The Future of Apache Airflow

Slide 43

Slide 43 text

Airflow 3 Make Airflow the foundation for Data, ML, and Gen AI orchestration for the next 5 years. 1. Enable secure remote task execution across network boundaries. 2. Integrate data awareness needed for governance and compliance 3. Enable non-python tasks, for integration with any language 4. Enable Versioning of Dags and Datasets 5. Single command local install for learning and experimentation.

Slide 44

Slide 44 text

Thank You A friendly reminder to RSVP to Airflow Summit 2024: ● Celebrating 10 Years of Airflow ● Sept. 10th-12th ● The Westin St. Francis ● San Francisco, CA @kaxil @kaxil @kaxil Airflow Summit Discount Code: 15DISC_MEETUP