Slide 1

Slide 1 text

Introduction to Airflow - A dataflow engine

Slide 2

Slide 2 text

Search for the perfect… no an acceptable ETL tool - Criteria - Configurations as a code - Easily customizable - Modular components - HA - UI - Easy to manage - ...

Slide 3

Slide 3 text

Search for the perfect tool… Informatica

Slide 4

Slide 4 text

Search for the perfect… no an acceptable ETL tool - Criteria - Configurations as a code - Easily customizable - Modular components - HA - UI - Easy to manage - ...

Slide 5

Slide 5 text

Search for the perfect tool… Matillion

Slide 6

Slide 6 text

Search for the perfect tool… Matillion - Criteria - Configurations as a code (Unreadable 2000 lines of 5 times nested json for a simple job) - Easily customizable - Modular components - HA - UI - Easy to manage

Slide 7

Slide 7 text

Search for the perfect tool……...

Slide 8

Slide 8 text

- ETL tools are targeted towards non-developers - Less coding more operating - Less automation more manual work - More fuss - Less fun - (Other harsh comments you can imagine) Search for the perfect tool …. Conclusion

Slide 9

Slide 9 text

- Workflow/Pipeline? - If I find a suitable workflow management platform and/or pipeline framework to work with, I can create my own data pipeline! - Turns out there are several companies that built those for us. (Pinterest, Spotify, Airbnb, and etc.) Why just settle for an ETL tool? Let’s code it up!

Slide 10

Slide 10 text

Why just settle for an ETL tool? Let’s code it up!

Slide 11

Slide 11 text

- Workflow/Pipeline tools? - Pinball (Pinterest) - Python based - Subpar UI - Dead community - Not containerized properly Why just settle for an ETL tool? Let’s code it up!

Slide 12

Slide 12 text

- Workflow/Pipeline tools? - Luigi (Spotify) Why just settle for an ETL tool? Let’s code it up!

Slide 13

Slide 13 text

- Workflow/Pipeline tools? - Luigi (Spotify) - Minimal UI - Does not have a scheduler (Relies on CRON jobs). - Still somewhat active developing going on. - Each task is a class - Do I have to build 100+ classes separately? lolz Why just settle for an ETL tool? Let’s code it up!

Slide 14

Slide 14 text

- Workflow/Pipeline tools? - Luigi (Spotify) - Minimal UI - Does not have a scheduler (Relies on CRON jobs). - Still somewhat active developing going on. - Each task is a class - Do I have to build 100+ classes separately? lolz Why just settle for an ETL tool? Let’s code it up!

Slide 15

Slide 15 text

- Workflow/Pipeline tools? - Airflow (AirBnB) Why just settle for an ETL tool? Let’s code it up!

Slide 16

Slide 16 text

- Workflow/Pipeline tools? - Airflow (Airbnb -> Apache) - Pretty good UI - Dynamic tasks and dag creation! (Will show you this) - Active society - Modular architecture - Dockerized and Helm chartified! Why just settle for an ETL tool? Let’s code it up!

Slide 17

Slide 17 text

- Workflow/Pipeline tools? - AWS Data Pipeline (No you do not use this) - Debugging hell - Others ( Oozie, Azkaban, … etc. ) Why just settle for an ETL tool? Let’s code it up!

Slide 18

Slide 18 text

- Platform to programmatically author, schedule and monitor workflows in Python - http://airflow.apache.org/ - Apache incubating - Dockerized - A Stable Kubernetes Helm chart (Dec 2018) Airflow - In a nutshell

Slide 19

Slide 19 text

- Data warehousing: - cleanse, organize, data quality check, and publish data into data warehouse - Growth analytics: - compute metrics around guest and host engagement as well as growth accounting - Experimentation: - Compute A/B testing experimentation frameworks logic and aggregates Airflow - How AirBnB uses it

Slide 20

Slide 20 text

- Motivation - Architecture - Key Themes/Principles - Sample project structure - Basic concepts Airflow - Agenda

Slide 21

Slide 21 text

- Configuration as a code - Was already built into our codebase. - But in an unfavorable way. - Prepare the architecture for more complex, robust pipelines - Promotes cleaner code Airflow - Motivations

Slide 22

Slide 22 text

- WebUI - Worker - Scheduler - Job Queue - Metadata Store Airflow - Architecture

Slide 23

Slide 23 text

- Extensible - Hook - Connect to any data source using the prebuilt hooks or create one if you need - Operator - Simple, functional modules that you can - Operate on any data that you can access using hooks. Airflow - Key Themes/Principles

Slide 24

Slide 24 text

- https://github.com/apache/airflow/blob/master/airflow /contrib/operators/s3_copy_object_operator.py - https://github.com/apache/airflow/blob/master/airflow /contrib/hooks/aws_dynamodb_hook.py Airflow - Key Themes/Principles

Slide 25

Slide 25 text

- Dynamic (Configuration as a code) - Airflow workflow/pipeline are built with Python. This allows for writing code that instantiates pipelines dynamically. - Combine python with YAML to author pipelines Airflow - Key Themes/Principles

Slide 26

Slide 26 text

Airflow - Key Themes/Principles

Slide 27

Slide 27 text

- Scalable - Runs on Kubernetes - Scale the number of worker nodes to match the workload Airflow - Key Themes/Principles

Slide 28

Slide 28 text

- Scenario : Build a pipeline to crawl, process(clean), group, and deploy contents - Different content types might need: - Different configurations (API token, crawling rates, etc) - Different connections Airflow - Basic Concepts

Slide 29

Slide 29 text

Airflow - Basic Concepts

Slide 30

Slide 30 text

- Operator (How) - Single responsibility - One of these methods - poke() - execute() Airflow - Basic Concepts

Slide 31

Slide 31 text

Airflow - Basic Concepts (Operator) - Sample Operators https://github.com/apache/airflow/blob/master/airflow/contrib/operators/s3_copy_object_operator.py

Slide 32

Slide 32 text

Airflow - Basic Concepts (Task) - Task - An instantiated operator - Parametrized with configs - Best if idempotent - Task Instance - Part of a dag - Has a state (‘running’, ‘failed’, ‘succeeded’, etc.)

Slide 33

Slide 33 text

Airflow - Basic Concepts - DAG - Basically a workflow definition - By default, downstream does not execute if upstream fails. - Complex logic can be applied (Ex. When 60% of upstream tasks succeed, execute the downstream)

Slide 34

Slide 34 text

Airflow - Basic Concepts

Slide 35

Slide 35 text

Airflow - Adv Concepts - Dynamic dag generation - sample? - Pools - Resource pooling to prevent resource contention - Backfill - Have to retroactively apply/aggregate/calculate past data.

Slide 36

Slide 36 text

Airflow - Sample Buzzvil DAG (1) - Sync_mysql_redshift - Creates DAGs by looping through each sync configuration - Simple 5 node DAG

Slide 37

Slide 37 text

Airflow - Sample Buzzvil DAG (2) - Mysql_redshift_migration - When MySQL schema changes should be propagated to Redshift - Just add in a migration config

Slide 38

Slide 38 text

References - https://airflow.apache.org/ - https://www.slideshare.net/dbenz/flow-is-in-the-air-best-practices-of-building-analytical-data -pipelines-with-apache-airflow-pyconde-2017 - https://www.astronomer.io/guides/airflow-vs-luigi/ - https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8 - https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b97 7fd8