Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

2 WHO AM I? • Studied in Budapest & Munich • Settled in Vienna • Backend developer -> Platform engineer • Data Platform 2022+ • https://speakerdeck.com/posedio/ • Endurance sports ( ) • For more catch me @ Vienna Marathon • Owner of two Dogs ( ) Tamas

Slide 3

Slide 3 text

3 1. Airflow 101 AGENDA

Slide 4

Slide 4 text

4 1. Airflow 101 2. Do I need an orchestrator? AGENDA

Slide 5

Slide 5 text

5 1. Airflow 101 2. Do I need an orchestrator? 3. Airflow’s architecture (“the cost”) AGENDA

Slide 6

Slide 6 text

6 1. Airflow 101 2. Do I need an orchestrator? 3. Airflow’s architecture (“the cost”) 4. Cloud Composer AGENDA

Slide 7

Slide 7 text

7 1. Airflow 101 2. Do I need an orchestrator? 3. Airflow’s architecture (“the cost”) 4. Cloud Composer 5. Multi-tenancy in Airflow AGENDA

Slide 8

Slide 8 text

8 1. Airflow 101 2. Do I need an orchestrator? 3. Airflow’s architecture (“the cost”) 4. Cloud Composer 5. Multi-tenancy in Airflow AGENDA

Slide 9

Slide 9 text

AIRFLOW 1

Slide 10

Slide 10 text

10 APACHE AIRFLOW • 2015: Airbnb needed a tool to author, iterate, and monitor batch data pipelines ( Article) • 2016: Airflow joined Apache Foundation • 2019: Airflow graduated • 2020+: Airflow 2.0 released, growing community

Slide 11

Slide 11 text

11 AIRFLOW SUMMIT 2024 • 3 days ~200 Talks

Slide 12

Slide 12 text

12 ASTRONOMER • 2018: SaaS offering for Airflow called Astro • Solid documentation on AirFlow • Integration of 3rd party tools (dbt ~ data build tool) • Enterprise-tier features for AirFlow (multi-tenancy with work-spaces etc.)

Slide 13

Slide 13 text

13 AIRFLOW UI: DAG OVERVIEW

Slide 14

Slide 14 text

14 AIRFLOW UI: GRAPH

Slide 15

Slide 15 text

15 AIRFLOW UI: AUDIT LOG

Slide 16

Slide 16 text

DO I REALLY NEED IT? 2

Slide 17

Slide 17 text

17 SESAME INC. • Inputs: • Recipe • Ingredients (~Stock level) • Processes • Configure mixer • Configure oven • Configure packaging • Start production • 23:00

Slide 18

Slide 18 text

18 SESAME INC.

Slide 19

Slide 19 text

19 AIRFLOW • DAG (directed acyclic graph) • Tasks: • Fetch recipe and stock • Configure mixer • Configure oven • Start production • Monitor packaging

Slide 20

Slide 20 text

20 AIRFLOW • DAG (directed acyclic graph) • Tasks: • Fetch recipe and stock • Configure mixer • Configure oven • Start production • Monitor packaging

Slide 21

Slide 21 text

21 CRON • 0 8 * * * /usr/bin/python start.py • Steps: • Fetch recipe and stock levels • Configure mixer • Configure oven • Start production • Monitor packaging

Slide 22

Slide 22 text

22 CATCHUP • Machine didn’t work over weekend => Sat, Sun, (Mon) cookies are missing. • Cron: ??? • AirFlow:

Slide 23

Slide 23 text

23 BACKFILL • While the cookies were stored in the storage room, they have gone bad. Can we re-run the batch? • Cron: ??? • AirFlow:

Slide 24

Slide 24 text

24 LINKEDIN: AIRFLOW BACKFILL PLUGIN

Slide 25

Slide 25 text

25 SENSORS • Checks if a condition is met at a specific interval. • timeout • poke_interaval • Types • GCS (files) • SQL • …

Slide 26

Slide 26 text

WHAT’S THE COST? 3

Slide 27

Slide 27 text

27 AIRFLOW ARCHITECTURE

Slide 28

Slide 28 text

28 AIRFLOW ARCHITECTURE: SCHEDULER

Slide 29

Slide 29 text

29 AIRFLOW ARCHITECTURE: SCHEDULER

Slide 30

Slide 30 text

30 AIRFLOW ARCHITECTURE

Slide 31

Slide 31 text

31 AIRFLOW EXECUTION ENVIRONMNET • Local machine: • BashOperator • PythonOperator • PythonVirtualenvOperator

Slide 32

Slide 32 text

32 AIRFLOW EXECUTION ENVIRONMNET • Local machine: • BashOperator • PythonOperator • PythonVirtualenvOperator

Slide 33

Slide 33 text

33 AIRFLOW EXECUTION ENVIRONMNET • Local machine: • BashOperator • PythonOperator • PythonVirtualenvOperator

Slide 34

Slide 34 text

34 AIRFLOW EXECUTION ENVIRONMNET • Remote: • Celery • KubernetesPodOperator • GkeStartPodOperator

Slide 35

Slide 35 text

35 AIRFLOW EXECUTION ENVIRONMNET • Remote: • Celery • KubernetesPodOperator • GkeStartPodOperator

Slide 36

Slide 36 text

36 CRON VS AIRFLOW • Job monitoring (UI vs CRON-log) • Job overview (Graph vs 1000s of lines) • Business logic decouples from scheduling • Efficient remote execution (K8s Pods) • Comes at a price of having to manage Airflow…

Slide 37

Slide 37 text

CLOUD COMPOSER 4

Slide 38

Slide 38 text

38 COMPOSER V2

Slide 39

Slide 39 text

39 COMPOSER V3 ~ „SERVERLESS“

Slide 40

Slide 40 text

40 COMPOSER V3 ~ „SERVERLESS“

Slide 41

Slide 41 text

MULTI TENANCY 5

Slide 42

Slide 42 text

42 LET‘S ONBOARD! Guy with a shady use-case Business Intelligence Finance I don‘t even know who you are… Data Analysts Her friends…

Slide 43

Slide 43 text

43 LIMITING ACCESS: VISIBILITY & EXECUTION

Slide 44

Slide 44 text

44 LIMITING ACCESS: PER-FOLDER ROLE REGISTRATION • rbac_autoregister_per_folder_roles -> True • Default role: UserNoDags • Airflow now generates roles: finance, da

Slide 45

Slide 45 text

45 LIMITING ACCESS TO RESOURCES

Slide 46

Slide 46 text

46 DYNAMIC DAG GENERATION • Dynamic DAGs and Dynamic Tasks

Slide 47

Slide 47 text

47 DYNAMIC DAG GENERATION CI/CD

Slide 48

Slide 48 text

48 ASTRONOMER: DAG-FACTORY

Slide 49

Slide 49 text

49 YAML? AGAIN?!

Slide 50

Slide 50 text

50 YAML? AGAIN?! +

Slide 51

Slide 51 text

51 SECURE? YES! BUT AT WHAT COST?

Slide 52

Slide 52 text

CLUSTER POLICIES 6

Slide 53

Slide 53 text

53 AIRFLOW CLUSTER POLICIES • DAG/Task Validation

Slide 54

Slide 54 text

54 AIRFLOW CLUSTER POLICIES • Mutation

Slide 55

Slide 55 text

55 LIMITING ACCESS 1. Ensure DAG lands in correct folder 2. (Optionally) simulate DAG parsing & loading 3. Cluster Policies ~> Validate and/or mutate

Slide 56

Slide 56 text

56 WRAPPING UP! • If you find yourself hand-crafting orchestration logic, consider frameworks (Airflow, Dagster, Prefect) • To reduce OpEx consider managed solutions (Cloud Composer ) • Look into the K8sPodOperator • Avoid YAML-hell • Use hooks and cherish the joy of Python

Slide 57

Slide 57 text

21.11.2024

Slide 58

Slide 58 text

No content