Apache Airflow DataEngConf SF 2017 Workshop

Slide 1

Slide 1 text

Apache Airflow Workshop ARTHUR WIEDMER / APRIL 2017

Slide 2

Slide 2 text

• Data Engineer on the Data Platform Team at Airbnb. • Working on Airflow since 2014. • Apache Airflow committer. • Most of my free time is spent with my wife and our 1 year-old son :) About Me 2

Slide 3

Slide 3 text

Introductions

Slide 4

Slide 4 text

A quick intro to Airflow

Slide 5

Slide 5 text

What is Airflow? What can Airflow do for you?

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Airflow?

Slide 8

Slide 8 text

Airflow?

Slide 9

Slide 9 text

• Companies grow to have a complex network of processes that have intricate dependencies. • Analytics & batch processing are mission critical. They serve decision makers and power machine learning models that can feed into production. • There is a lot of time invested in writing and monitoring jobs and troubleshooting issues. Why does Airflow exist? 7

Slide 10

Slide 10 text

An open source platform to author, orchestrate and monitor batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code What is Airflow? 8

Slide 11

Slide 11 text

Concepts •Workflows are called DAGs for Directed Acyclic Graph.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Concepts • Tasks: Workflows are composed of tasks called Operators. • Operators can do pretty much anything that can be run on the Airflow machine. • We tend to classify operators in 3 categories : Sensors, Operators, Transfers.

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Setting dependencies t2.set_upstream(t1)

Slide 18

Slide 18 text

Architecture

Slide 19

Slide 19 text

Metadata DB Architecture

Slide 20

Slide 20 text

Scheduler Metadata DB Architecture

Slide 21

Slide 21 text

Scheduler Metadata DB Worker Architecture

Slide 22

Slide 22 text

Scheduler Metadata DB Webserver Worker Architecture

Slide 23

Slide 23 text

Scheduler Metadata DB Webserver Worker Code repository Architecture

Slide 24

Slide 24 text

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery) Architecture

Slide 25

Slide 25 text

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery) Architecture Worker

Slide 26

Slide 26 text

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery) Architecture Worker Worker

Slide 27

Slide 27 text

What can Airflow do for you?

Slide 28

Slide 28 text

Monitoring

Slide 29

Slide 29 text

Monitoring DAG Status

Slide 30

Slide 30 text

Monitoring DAG Status

Slide 31

Slide 31 text

Monitoring Gantt Chart style

Slide 32

Slide 32 text

Monitoring Analytics

Slide 33

Slide 33 text

Scale

Slide 34

Slide 34 text

Airflow @ Airbnb : scale • We currently run 800+ DAGs and about ~80k tasks as day. • We have DAGs running at daily, hourly and 10 minute granularities. We also have ad hoc DAGs. • About 100 people @ Airbnb have authored or contributed to a DAG directly and 500 have contributed or modified a configuration to one of our frameworks. • We use the Celery executor with Redis as a backend.

Slide 35

Slide 35 text

Flexibility and Extensibility

Slide 36

Slide 36 text

Airflow @ Airbnb Data Warehousing Experimentation Growth Analytics Email Targeting Sessionization Search Ranking Infrastructure Monitoring Engagement Analytics Anomaly Detection Operational Work Data Exports from/to production

Slide 37

Slide 37 text

Common Pattern Abstracted static config python / yaml / hocon Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script

Slide 38

Slide 38 text

Common Pattern Abstracted static config python / yaml / hocon Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script

Slide 39

Slide 39 text

Common Pattern Abstracted static config python / yaml / hocon Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script

Slide 40

Slide 40 text

Common Pattern Abstracted static config python / yaml / hocon Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script

Slide 41

Slide 41 text

Slide 42

Slide 42 text

CumSum Efficient cumulative metrics computation • Live to date metrics per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total Outputs • An efficient pipeline • Easy / efficient backfilling capabilities • A centralized table, partitioned by metric and date, documented by code • Allows for efficient time range deltas by scanning 2 partitions

Slide 43

Slide 43 text

Anatomy of a DAG

Slide 44

Slide 44 text

Git Repository for the workshop • The git repository with course materials is here: • https://github.com/artwr/airflow-workshop-dataengconf-sf-2017 • Some of the materials have been ported as a Sphinx documentation • https://artwr.github.io/airflow-workshop-dataengconf-sf-2017/

Slide 45

Slide 45 text

Anatomy of a DAG: Setup • In Python, you must import a few things explicitly. In particular, the DAG class and the operators you want to use.

Slide 46

Slide 46 text

Anatomy of a DAG: Default Arguments • Default args contain some parameters that you can use for all tasks, like the owner, the start_date, the number of retries. Most of these arguments can be overwritten at the task level.

Slide 47

Slide 47 text

Anatomy of a DAG: DAG definition • Create a DAG

Slide 48

Slide 48 text

Anatomy of a DAG: Adding operators • We have imported operators. Sensors usually require a time or a particular resource to check. Operators will require either a simple command or a path to a script (In the BashOperator below, bash_command could be “path/to/script.sh”).

Slide 49

Slide 49 text

Anatomy of a DAG: Setting dependencies • Finally we want to define how operators relate to each other in the DAG. You can choose store the task objects and use the set_upstream or set_downstream. A common pattern for us is to store the dependencies as a dictionary, iterate over the items and use set_dependency.

Slide 50

Slide 50 text

Architecture

Slide 51

Slide 51 text

Architecture

Slide 52

Slide 52 text

Metadata DB Architecture

Slide 53

Slide 53 text

Scheduler Metadata DB Architecture

Slide 54

Slide 54 text

Scheduler Metadata DB Worker Architecture

Slide 55

Slide 55 text

Scheduler Metadata DB Webserver Worker Architecture

Slide 56

Slide 56 text

Scheduler Metadata DB Webserver Worker Code repository Architecture

Slide 57

Slide 57 text

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery) Architecture

Slide 58

Slide 58 text

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery) Architecture Worker

Slide 59

Slide 59 text

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery) Architecture Worker Worker

Slide 60

Slide 60 text

Scheduler/Executor Airflow •You have a choice of Executors which enable different ways to distribute tasks (See : https://airflow.incubator.apache.org/configuration.html#): • SequentialExecutor • LocalExecutor • CeleryExecutor • MesosExecutor (community contributed) •The SequentialExecutor will only execute one task a a time in process. •The LocalExecutor uses local processes. The number of processes can be scaled with the machine. •Celery and Mesos are a way to handle multiple worker machines to scale out.

Slide 61

Slide 61 text

Challenges: DAGs are file based •Dynamic DAGs to the rescue. •A common pattern that we use are DAG factories to create DAGs based on configurations. •The configuration can live in static config files or a Database. •One thing to remember is that Airflow is geared towards slowly changing DAGs.

Slide 62

Slide 62 text

Challenges: State •Propagating State in a distributed system is hard. •Multiple states to handle helpful things like automated retries, skipping tasks, detecting scheduling locks. (https://github.com/apache/incubator- airflow/blob/master/airflow/utils/state.py#L26-L57) •We have addressed a decent amount of those issues but are still discovering edge cases.

Slide 63

Slide 63 text

Challenges: Security •Authentication : Currently support for LDAP, there is pluggable auth possible. •Authorization : Right now mostly based on Flask. - Usually 3 level: Not logged in, Logged in, Superuser. - It is possible to hide some pages/views based on this. • Access control: Pretty wide right now.

Slide 64

Slide 64 text

What should you know to get started?

Slide 65

Slide 65 text

Best Practices for deployment

Slide 66

Slide 66 text

Getting Started with deploying Airflow •Usually people start their proof of concept with running the LocalExecutor. •In this case you need to have a production ready metadata db like MySQL or Postgres. •The scheduler is still the weakest link. Enabling service monitoring using something like runit, monit etc…

Slide 67

Slide 67 text

Metadata Database •As the number of jobs you run on Airflow increases, so does the load on the Airflow database. It is not uncommon for the Airflow database to require a decent amount of CPU if you execute a large number of concurrent tasks. (We are working on reducing the db load) •SQLite is used for tutorials but cannot handle concurrent connections. We highly recommend switching to MySQL/MariaDB or Postgres. •Some people have tried other databases, but we cannot currently test against them, so it might break in the future.

Slide 68

Slide 68 text

Deploying DAGs •Put your DAGs in source control. There are several methods to get them to the worker machines : - Pulling from a SCM repository with cron. - Using a deploy system to unzip an archive of the DAGs. •The main things to remember is that Python processes will keep the version they have in memory unless specifically refreshed. This can be a problem for a long running web server, where you can see a lag between the web server and what is deployed. A refresh can be triggered via the UI or API.

Slide 69

Slide 69 text

Best Practices for Pipelines

Slide 70

Slide 70 text

Monitoring and Alerting on your DAGs •Enable the email feature and EmailOperator/SlackOperator for monitoring completion and failure. •Ease of monitoring will help you keep track of your jobs as their number grows. •Checkout the SLA feature to know when your jobs are not completing on time. •If you have more custom needs, airflow supports arbitrary callbacks in Python on success, failure and retry.

Slide 71

Slide 71 text

Best Practices about DAG building: Architecture •Try to make you tasks idempotent (drop partition/insert overwrite/delete output files before writing them). Airflow will then be able to handle retrying for you in case of failure. •Common patterns are : - Sensor -> Transfer (Extract) -> Transform -> Store results (Load) - Stage transformed data -> run data quality checks -> move to final location.

Slide 72

Slide 72 text

Best Practices about DAG building: Managing resources •You can setup pools for resource management. Pools are a way to limit the concurrency of expensive tasks across DAGs (For instance running Spark jobs, or accessing a RDBMS). They can be setup via the UI. •If you need specialized workers, the CeleryExecutor allows you to setup different queues and workers consuming different types of tasks. The LocalExecutor is does not have this concept, but a similar result can be obtained by sharding by DAGs on separate boxes. •If you use the cgroup task runner, you have the opportunity to limit resource usage (CPU, memory) on a per task basis.

Slide 73

Slide 73 text

Configuration as Code! As an alternative to static YAML, JSON or worse: drag and drop tools • Code is more expressive, powerful & compact • Reusable components (functions, classes, object factories) come naturally in code • An API has a clear specification with defaults, input validation and useful methods • Nothing gets lost into translation: Python is the language of Airflow. • The API can be derived/extended as part of the workflow code. Build your own Operators, Hooks etc… • In its minimal form, it’s as simple as static configuration

Slide 74

Slide 74 text

The future of Airflow

Slide 75

Slide 75 text

• Back in 2014 we were using Chronos a Framework for long running jobs on top of Mesos. • Defining data dependencies was near impossible. Debugging why data was not landing on time was really difficult. • Max Beauchemin joined Airbnb and was interested in open sourcing an entirely rewritten version of Data Swarm, the job authoring platform at Facebook. • Introduced Jan 2015 for our main warehouse pipeline. • Open sourced in early 2015, donated to the Apache Foundation for Incubation in march 2016. Quick history of Airflow @ Airbnb 52

Slide 76

Slide 76 text

• The community is currently working on version 1.8.1rc2. To be released soon. • The focus has been on stability and performance enhancements. • We hope to graduate to Top Level Project this year. • We are looking for contributors. Check out the project and come hack with us. Apache Airflow 53

Slide 77

Slide 77 text

Resources

Slide 78

Slide 78 text

Airflow Resources • Gitter is fairly active at https://gitter.im/apache/incubator-airflow and has a lot of user to user help. • If you have more advanced questions, the dev mailing list at http://mail- archives.apache.org/mod_mbox/incubator-airflow-dev/ has the core developers on it. • The documentation is available at https://airflow.incubator.apache.org/ • The project also has a wiki : https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home 55

Slide 79

Slide 79 text

Airflow Talks • The Bay Area Airflow meet up : https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/ • Matt Davis at PyBay 2016: https://speakerdeck.com/pybay2016/matt-davis-a-practical-introduction-to-airflow • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and scheduling with Airflow : https://www.youtube.com/watch?v=60FUHEkcPyY 56

Slide 80

Slide 80 text

Airflow • Gerard Toonstra, a contributor on the mailing lists has written some thoughts about ETL with Airflow https://gtoonstra.github.io/etl-with-airflow/ • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and scheduling with Airflow : https://www.youtube.com/watch?v=60FUHEkcPyY 57

Slide 81

Slide 81 text

Questions?

Slide 82

Slide 82 text

Other Frameworks built on the Airflow Platform

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

Airflow?

Slide 85

Slide 85 text

Airflow?

Slide 86

Slide 86 text

Airflow? An open source platform to author, orchestrate and monitor batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Slide 90

Slide 90 text

AutoDAG anyone can schedule a simple query

Slide 91

Slide 91 text

AutoDAG anyone can schedule a simple query Behind the scene • Validates your SQL, makes sure it parses • Advises against bad SQL patterns • Introspects your code and infers your dependencies on other tables / partitions • Schedules your workflow, Airflow emails you on failure

Slide 92

Slide 92 text

Slide 93

Slide 93 text

Slide 94

Slide 94 text

Engagement & Growth metrics DAU, WAU, MAU / new, churn, resurrected, stale and active users • COUNT DISTINCT metrics are complex to compute efficiently • Web companies are obsessed with these metrics! • Typically needs to be computed for many sub-products and many core dimensions Behind the scene • Translates into a complex workflow is issued for each entry • “Cubes” the data by running multiple groupings • Joins to the user dimension to gather specified demographics • “Backfills” the data since the activation date • Leaves a useful computational trail for deeper analysis • Runs optimized logic • Cuts the long tail of high cardinality dimensions as specified • Delivers summarized data to use in reports and dashboards

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Experimentation A/B testing at scale (simplified) Define user metrics as SQL Configure your experiments

Slide 98

Slide 98 text

Experimentation A/B testing at scale (simplified) Define user metrics as SQL Configure your experiments

Slide 99

Slide 99 text

Experimentation a small portion of the whole experimentation workflow tasks backing an individual experiment Wait for source partitions Load into metrics repository Compute atomic data for the experiment Aggregate metric events and compute stats Conceptually Export summary to MySQL

Slide 100

Slide 100 text

Experimentation ds (partition) metric_source (partition) userid BIGINT dimension_map MAP event_name STRING value NUMBER metrics_repo data structures overview (simplified) ds (partition) experiment STRING treatment STRING userid BIGINT first_exposure_ts STRING experiment_assignments ds (partition) experiment STRING treatment_name STRING control_name STRING delta DOUBLE pvalue DOUBLE experiment_stats …

Slide 101

Slide 101 text

Experimentation overlooked complexity in previous slides • user take days or weeks to go through our main flows • cookie -> userid mapping • event level attributes, dimensional breakdowns • different types of subjects (host, guests, listing, cookie, …) • different types of experimentation (web, mobile, emails, tickets…) • “themes” are defined as sets of metrics • Statistics beyond pvalue and confidence intervals: preventing bias, global impact, time-boxing

Slide 102

Slide 102 text

Stats Daemon Build database statistics on Hive using Presto • Monitor the Hive metastore’s partition table for last updated time stamp • for each recently modified partition, generate a single scan query that computes loads of metrics * for numeric value, compute MIN, MAX, AVG, SUM, NULL_COUNT, COUNT DISTINCT, … * for strings, count the number of characters, COUNT_DISTINCT, NULL_COUNT, … * based on naming conventions, add more specific rules * whitelist / blacklist namespaces, regexes, … • Load statistics into MySQL • Used for capacity planning, data quality monitoring, debugging, anomaly detection, alerting, … cluster STRING database STRING table BIGINT partition STRING stat_expr STRING value NUMBER partition_stats

Slide 103

Slide 103 text

Other Airflow Frameworks • Anomaly detection • Production MySQL exports • AirOLAP: loads data into druid.io • Email targeting rule engine • Cohort Analysis & user segmentation (prototype) • …

Slide 104

Slide 104 text

No content