Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Airflow - A dataflow engine

Buzzvil
January 16, 2019
460

Introduction to Airflow - A dataflow engine

Buzzvil

January 16, 2019
Tweet

Transcript

  1. Search for the perfect… no an acceptable ETL tool -

    Criteria - Configurations as a code - Easily customizable - Modular components - HA - UI - Easy to manage - ...
  2. Search for the perfect… no an acceptable ETL tool -

    Criteria - Configurations as a code - Easily customizable - Modular components - HA - UI - Easy to manage - ...
  3. Search for the perfect tool… Matillion - Criteria - Configurations

    as a code (Unreadable 2000 lines of 5 times nested json for a simple job) - Easily customizable - Modular components - HA - UI - Easy to manage
  4. - ETL tools are targeted towards non-developers - Less coding

    more operating - Less automation more manual work - More fuss - Less fun - (Other harsh comments you can imagine) Search for the perfect tool …. Conclusion
  5. - Workflow/Pipeline? - If I find a suitable workflow management

    platform and/or pipeline framework to work with, I can create my own data pipeline! - Turns out there are several companies that built those for us. (Pinterest, Spotify, Airbnb, and etc.) Why just settle for an ETL tool? Let’s code it up!
  6. - Workflow/Pipeline tools? - Pinball (Pinterest) - Python based -

    Subpar UI - Dead community - Not containerized properly Why just settle for an ETL tool? Let’s code it up!
  7. - Workflow/Pipeline tools? - Luigi (Spotify) - Minimal UI -

    Does not have a scheduler (Relies on CRON jobs). - Still somewhat active developing going on. - Each task is a class - Do I have to build 100+ classes separately? lolz Why just settle for an ETL tool? Let’s code it up!
  8. - Workflow/Pipeline tools? - Luigi (Spotify) - Minimal UI -

    Does not have a scheduler (Relies on CRON jobs). - Still somewhat active developing going on. - Each task is a class - Do I have to build 100+ classes separately? lolz Why just settle for an ETL tool? Let’s code it up!
  9. - Workflow/Pipeline tools? - Airflow (Airbnb -> Apache) - Pretty

    good UI - Dynamic tasks and dag creation! (Will show you this) - Active society - Modular architecture - Dockerized and Helm chartified! Why just settle for an ETL tool? Let’s code it up!
  10. - Workflow/Pipeline tools? - AWS Data Pipeline (No you do

    not use this) - Debugging hell - Others ( Oozie, Azkaban, … etc. ) Why just settle for an ETL tool? Let’s code it up!
  11. - Platform to programmatically author, schedule and monitor workflows in

    Python - http://airflow.apache.org/ - Apache incubating - Dockerized - A Stable Kubernetes Helm chart (Dec 2018) Airflow - In a nutshell
  12. - Data warehousing: - cleanse, organize, data quality check, and

    publish data into data warehouse - Growth analytics: - compute metrics around guest and host engagement as well as growth accounting - Experimentation: - Compute A/B testing experimentation frameworks logic and aggregates Airflow - How AirBnB uses it
  13. - Motivation - Architecture - Key Themes/Principles - Sample project

    structure - Basic concepts Airflow - Agenda
  14. - Configuration as a code - Was already built into

    our codebase. - But in an unfavorable way. - Prepare the architecture for more complex, robust pipelines - Promotes cleaner code Airflow - Motivations
  15. - WebUI - Worker - Scheduler - Job Queue -

    Metadata Store Airflow - Architecture
  16. - Extensible - Hook - Connect to any data source

    using the prebuilt hooks or create one if you need - Operator - Simple, functional modules that you can - Operate on any data that you can access using hooks. Airflow - Key Themes/Principles
  17. - Dynamic (Configuration as a code) - Airflow workflow/pipeline are

    built with Python. This allows for writing code that instantiates pipelines dynamically. - Combine python with YAML to author pipelines Airflow - Key Themes/Principles
  18. - Scalable - Runs on Kubernetes - Scale the number

    of worker nodes to match the workload Airflow - Key Themes/Principles
  19. - Scenario : Build a pipeline to crawl, process(clean), group,

    and deploy contents - Different content types might need: - Different configurations (API token, crawling rates, etc) - Different connections Airflow - Basic Concepts
  20. - Operator (How) - Single responsibility - One of these

    methods - poke() - execute() Airflow - Basic Concepts
  21. Airflow - Basic Concepts (Task) - Task - An instantiated

    operator - Parametrized with configs - Best if idempotent - Task Instance - Part of a dag - Has a state (‘running’, ‘failed’, ‘succeeded’, etc.)
  22. Airflow - Basic Concepts - DAG - Basically a workflow

    definition - By default, downstream does not execute if upstream fails. - Complex logic can be applied (Ex. When 60% of upstream tasks succeed, execute the downstream)
  23. Airflow - Adv Concepts - Dynamic dag generation - sample?

    - Pools - Resource pooling to prevent resource contention - Backfill - Have to retroactively apply/aggregate/calculate past data.
  24. Airflow - Sample Buzzvil DAG (1) - Sync_mysql_redshift - Creates

    DAGs by looping through each sync configuration - Simple 5 node DAG
  25. Airflow - Sample Buzzvil DAG (2) - Mysql_redshift_migration - When

    MySQL schema changes should be propagated to Redshift - Just add in a migration config