Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow DataEngConf SF 2017 Workshop

Apache Airflow DataEngConf SF 2017 Workshop

These slides cover some of the materials explored at the Apache Airflow DataEngConf. The companion repository is at https://github.com/artwr/airflow-workshop-dataengconf-sf-2017

Arthur Wiedmer

April 26, 2017
Tweet

More Decks by Arthur Wiedmer

Other Decks in Programming

Transcript

  1. • Data Engineer on the Data Platform Team at Airbnb.

    • Working on Airflow since 2014. • Apache Airflow committer. • Most of my free time is spent with my wife and our 1 year-old son :) About Me 2
  2. • Companies grow to have a complex network of processes

    that have intricate dependencies. • Analytics & batch processing are mission critical. They serve decision makers and power machine learning models that can feed into production. • There is a lot of time invested in writing and monitoring jobs and troubleshooting issues. Why does Airflow exist? 7
  3. An open source platform to author, orchestrate and monitor batch

    processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code What is Airflow? 8
  4. Concepts • Tasks: Workflows are composed of tasks called Operators.

    • Operators can do pretty much anything that can be run on the Airflow machine. • We tend to classify operators in 3 categories : Sensors, Operators, Transfers.
  5. Airflow @ Airbnb : scale • We currently run 800+

    DAGs and about ~80k tasks as day. • We have DAGs running at daily, hourly and 10 minute granularities. We also have ad hoc DAGs. • About 100 people @ Airbnb have authored or contributed to a DAG directly and 500 have contributed or modified a configuration to one of our frameworks. • We use the Celery executor with Redis as a backend.
  6. Airflow @ Airbnb Data Warehousing Experimentation Growth Analytics Email Targeting

    Sessionization Search Ranking Infrastructure Monitoring Engagement Analytics Anomaly Detection Operational Work Data Exports from/to production
  7. Common Pattern Abstracted static config python / yaml / hocon

    Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script
  8. Common Pattern Abstracted static config python / yaml / hocon

    Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script
  9. Common Pattern Abstracted static config python / yaml / hocon

    Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script
  10. Common Pattern Abstracted static config python / yaml / hocon

    Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script
  11. CumSum Efficient cumulative metrics computation • Live to date metrics

    per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total
  12. CumSum Efficient cumulative metrics computation • Live to date metrics

    per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total Outputs • An efficient pipeline • Easy / efficient backfilling capabilities • A centralized table, partitioned by metric and date, documented by code • Allows for efficient time range deltas by scanning 2 partitions
  13. Git Repository for the workshop • The git repository with

    course materials is here: • https://github.com/artwr/airflow-workshop-dataengconf-sf-2017 • Some of the materials have been ported as a Sphinx documentation • https://artwr.github.io/airflow-workshop-dataengconf-sf-2017/
  14. Anatomy of a DAG: Setup • In Python, you must

    import a few things explicitly. In particular, the DAG class and the operators you want to use.
  15. Anatomy of a DAG: Default Arguments • Default args contain

    some parameters that you can use for all tasks, like the owner, the start_date, the number of retries. Most of these arguments can be overwritten at the task level.
  16. Anatomy of a DAG: Adding operators • We have imported

    operators. Sensors usually require a time or a particular resource to check. Operators will require either a simple command or a path to a script (In the BashOperator below, bash_command could be “path/to/script.sh”).
  17. Anatomy of a DAG: Setting dependencies • Finally we want

    to define how operators relate to each other in the DAG. You can choose store the task objects and use the set_upstream or set_downstream. A common pattern for us is to store the dependencies as a dictionary, iterate over the items and use set_dependency.
  18. Scheduler/Executor Airflow •You have a choice of Executors which enable

    different ways to distribute tasks (See : https://airflow.incubator.apache.org/configuration.html#): • SequentialExecutor • LocalExecutor • CeleryExecutor • MesosExecutor (community contributed) •The SequentialExecutor will only execute one task a a time in process. •The LocalExecutor uses local processes. The number of processes can be scaled with the machine. •Celery and Mesos are a way to handle multiple worker machines to scale out.
  19. Challenges: DAGs are file based •Dynamic DAGs to the rescue.

    •A common pattern that we use are DAG factories to create DAGs based on configurations. •The configuration can live in static config files or a Database. •One thing to remember is that Airflow is geared towards slowly changing DAGs.
  20. Challenges: State •Propagating State in a distributed system is hard.

    •Multiple states to handle helpful things like automated retries, skipping tasks, detecting scheduling locks. (https://github.com/apache/incubator- airflow/blob/master/airflow/utils/state.py#L26-L57) •We have addressed a decent amount of those issues but are still discovering edge cases.
  21. Challenges: Security •Authentication : Currently support for LDAP, there is

    pluggable auth possible. •Authorization : Right now mostly based on Flask. - Usually 3 level: Not logged in, Logged in, Superuser. - It is possible to hide some pages/views based on this. • Access control: Pretty wide right now.
  22. Getting Started with deploying Airflow •Usually people start their proof

    of concept with running the LocalExecutor. •In this case you need to have a production ready metadata db like MySQL or Postgres. •The scheduler is still the weakest link. Enabling service monitoring using something like runit, monit etc…
  23. Metadata Database •As the number of jobs you run on

    Airflow increases, so does the load on the Airflow database. It is not uncommon for the Airflow database to require a decent amount of CPU if you execute a large number of concurrent tasks. (We are working on reducing the db load) •SQLite is used for tutorials but cannot handle concurrent connections. We highly recommend switching to MySQL/MariaDB or Postgres. •Some people have tried other databases, but we cannot currently test against them, so it might break in the future.
  24. Deploying DAGs •Put your DAGs in source control. There are

    several methods to get them to the worker machines : - Pulling from a SCM repository with cron. - Using a deploy system to unzip an archive of the DAGs. •The main things to remember is that Python processes will keep the version they have in memory unless specifically refreshed. This can be a problem for a long running web server, where you can see a lag between the web server and what is deployed. A refresh can be triggered via the UI or API.
  25. Monitoring and Alerting on your DAGs •Enable the email feature

    and EmailOperator/SlackOperator for monitoring completion and failure. •Ease of monitoring will help you keep track of your jobs as their number grows. •Checkout the SLA feature to know when your jobs are not completing on time. •If you have more custom needs, airflow supports arbitrary callbacks in Python on success, failure and retry.
  26. Best Practices about DAG building: Architecture •Try to make you

    tasks idempotent (drop partition/insert overwrite/delete output files before writing them). Airflow will then be able to handle retrying for you in case of failure. •Common patterns are : - Sensor -> Transfer (Extract) -> Transform -> Store results (Load) - Stage transformed data -> run data quality checks -> move to final location.
  27. Best Practices about DAG building: Managing resources •You can setup

    pools for resource management. Pools are a way to limit the concurrency of expensive tasks across DAGs (For instance running Spark jobs, or accessing a RDBMS). They can be setup via the UI. •If you need specialized workers, the CeleryExecutor allows you to setup different queues and workers consuming different types of tasks. The LocalExecutor is does not have this concept, but a similar result can be obtained by sharding by DAGs on separate boxes. •If you use the cgroup task runner, you have the opportunity to limit resource usage (CPU, memory) on a per task basis.
  28. Configuration as Code! As an alternative to static YAML, JSON

    or worse: drag and drop tools • Code is more expressive, powerful & compact • Reusable components (functions, classes, object factories) come naturally in code • An API has a clear specification with defaults, input validation and useful methods • Nothing gets lost into translation: Python is the language of Airflow. • The API can be derived/extended as part of the workflow code. Build your own Operators, Hooks etc… • In its minimal form, it’s as simple as static configuration
  29. • Back in 2014 we were using Chronos a Framework

    for long running jobs on top of Mesos. • Defining data dependencies was near impossible. Debugging why data was not landing on time was really difficult. • Max Beauchemin joined Airbnb and was interested in open sourcing an entirely rewritten version of Data Swarm, the job authoring platform at Facebook. • Introduced Jan 2015 for our main warehouse pipeline. • Open sourced in early 2015, donated to the Apache Foundation for Incubation in march 2016. Quick history of Airflow @ Airbnb 52
  30. • The community is currently working on version 1.8.1rc2. To

    be released soon. • The focus has been on stability and performance enhancements. • We hope to graduate to Top Level Project this year. • We are looking for contributors. Check out the project and come hack with us. Apache Airflow 53
  31. Airflow Resources • Gitter is fairly active at https://gitter.im/apache/incubator-airflow and

    has a lot of user to user help. • If you have more advanced questions, the dev mailing list at http://mail- archives.apache.org/mod_mbox/incubator-airflow-dev/ has the core developers on it. • The documentation is available at https://airflow.incubator.apache.org/ • The project also has a wiki : https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home 55
  32. Airflow Talks • The Bay Area Airflow meet up :

    https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/ • Matt Davis at PyBay 2016: https://speakerdeck.com/pybay2016/matt-davis-a-practical-introduction-to-airflow • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and scheduling with Airflow : https://www.youtube.com/watch?v=60FUHEkcPyY 56
  33. Airflow • Gerard Toonstra, a contributor on the mailing lists

    has written some thoughts about ETL with Airflow https://gtoonstra.github.io/etl-with-airflow/ • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and scheduling with Airflow : https://www.youtube.com/watch?v=60FUHEkcPyY 57
  34. Airflow? An open source platform to author, orchestrate and monitor

    batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code
  35. Airflow? An open source platform to author, orchestrate and monitor

    batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code
  36. Airflow? An open source platform to author, orchestrate and monitor

    batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code
  37. Airflow? An open source platform to author, orchestrate and monitor

    batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code
  38. AutoDAG anyone can schedule a simple query Behind the scene

    • Validates your SQL, makes sure it parses • Advises against bad SQL patterns • Introspects your code and infers your dependencies on other tables / partitions • Schedules your workflow, Airflow emails you on failure
  39. AutoDAG anyone can schedule a simple query Behind the scene

    • Validates your SQL, makes sure it parses • Advises against bad SQL patterns • Introspects your code and infers your dependencies on other tables / partitions • Schedules your workflow, Airflow emails you on failure
  40. Engagement & Growth metrics DAU, WAU, MAU / new, churn,

    resurrected, stale and active users • COUNT DISTINCT metrics are complex to compute efficiently • Web companies are obsessed with these metrics! • Typically needs to be computed for many sub-products and many core dimensions
  41. Engagement & Growth metrics DAU, WAU, MAU / new, churn,

    resurrected, stale and active users • COUNT DISTINCT metrics are complex to compute efficiently • Web companies are obsessed with these metrics! • Typically needs to be computed for many sub-products and many core dimensions Behind the scene • Translates into a complex workflow is issued for each entry • “Cubes” the data by running multiple groupings • Joins to the user dimension to gather specified demographics • “Backfills” the data since the activation date • Leaves a useful computational trail for deeper analysis • Runs optimized logic • Cuts the long tail of high cardinality dimensions as specified • Delivers summarized data to use in reports and dashboards
  42. CumSum Efficient cumulative metrics computation • Live to date metrics

    per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total
  43. CumSum Efficient cumulative metrics computation • Live to date metrics

    per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total Outputs • An efficient pipeline • Easy / efficient backfilling capabilities • A centralized table, partitioned by metric and date, documented by code • Allows for efficient time range deltas by scanning 2 partitions
  44. Experimentation a small portion of the whole experimentation workflow tasks

    backing an individual experiment Wait for source partitions Load into metrics repository Compute atomic data for the experiment Aggregate metric events and compute stats Conceptually Export summary to MySQL
  45. Experimentation ds (partition) metric_source (partition) userid BIGINT dimension_map MAP event_name

    STRING value NUMBER metrics_repo data structures overview (simplified) ds (partition) experiment STRING treatment STRING userid BIGINT first_exposure_ts STRING experiment_assignments ds (partition) experiment STRING treatment_name STRING control_name STRING delta DOUBLE pvalue DOUBLE experiment_stats …
  46. Experimentation overlooked complexity in previous slides • user take days

    or weeks to go through our main flows • cookie -> userid mapping • event level attributes, dimensional breakdowns • different types of subjects (host, guests, listing, cookie, …) • different types of experimentation (web, mobile, emails, tickets…) • “themes” are defined as sets of metrics • Statistics beyond pvalue and confidence intervals: preventing bias, global impact, time-boxing
  47. Stats Daemon Build database statistics on Hive using Presto •

    Monitor the Hive metastore’s partition table for last updated time stamp • for each recently modified partition, generate a single scan query that computes loads of metrics * for numeric value, compute MIN, MAX, AVG, SUM, NULL_COUNT, COUNT DISTINCT, … * for strings, count the number of characters, COUNT_DISTINCT, NULL_COUNT, … * based on naming conventions, add more specific rules * whitelist / blacklist namespaces, regexes, … • Load statistics into MySQL • Used for capacity planning, data quality monitoring, debugging, anomaly detection, alerting, … cluster STRING database STRING table BIGINT partition STRING stat_expr STRING value NUMBER partition_stats
  48. Other Airflow Frameworks • Anomaly detection • Production MySQL exports

    • AirOLAP: loads data into druid.io • Email targeting rule engine • Cohort Analysis & user segmentation (prototype) • …