Apache Airflow DataEngConf SF 2017 Workshop

Apache Airflow Workshop ARTHUR WIEDMER / APRIL 2017

• Data Engineer on the Data Platform Team at Airbnb.
• Working on Airflow since 2014. • Apache Airflow committer. • Most of my free time is spent with my wife and our 1 year-old son :) About Me 2

Introductions

A quick intro to Airflow

What is Airflow? What can Airflow do for you?

Airflow?

• Companies grow to have a complex network of processes
that have intricate dependencies. • Analytics & batch processing are mission critical. They serve decision makers and power machine learning models that can feed into production. • There is a lot of time invested in writing and monitoring jobs and troubleshooting issues. Why does Airflow exist? 7

An open source platform to author, orchestrate and monitor batch
processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code What is Airflow? 8

Concepts •Workflows are called DAGs for Directed Acyclic Graph.

Concepts • Tasks: Workflows are composed of tasks called Operators.
• Operators can do pretty much anything that can be run on the Airflow machine. • We tend to classify operators in 3 categories : Sensors, Operators, Transfers.

Setting dependencies t2.set_upstream(t1)

Architecture

Metadata DB Architecture

Scheduler Metadata DB Architecture

Scheduler Metadata DB Worker Architecture

Scheduler Metadata DB Webserver Worker Architecture

Scheduler Metadata DB Webserver Worker Code repository Architecture

Scheduler Metadata DB Webserver Worker Code repository Message Queue (Celery)
Architecture

Architecture Worker

Architecture Worker Worker

What can Airflow do for you?

Monitoring

Monitoring DAG Status

Monitoring Gantt Chart style

Monitoring Analytics

Airflow @ Airbnb : scale • We currently run 800+
DAGs and about ~80k tasks as day. • We have DAGs running at daily, hourly and 10 minute granularities. We also have ad hoc DAGs. • About 100 people @ Airbnb have authored or contributed to a DAG directly and 500 have contributed or modified a configuration to one of our frameworks. • We use the Celery executor with Redis as a backend.

Flexibility and Extensibility

Airflow @ Airbnb Data Warehousing Experimentation Growth Analytics Email Targeting
Sessionization Search Ranking Infrastructure Monitoring Engagement Analytics Anomaly Detection Operational Work Data Exports from/to production

Common Pattern Abstracted static config python / yaml / hocon
Input Web app Output Derived data - or - Alerts & Notifications Data Processing Workflow Airflow script

CumSum Efficient cumulative metrics computation • Live to date metrics
per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total

per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total Outputs • An efficient pipeline • Easy / efficient backfilling capabilities • A centralized table, partitioned by metric and date, documented by code • Allows for efficient time range deltas by scanning 2 partitions

Anatomy of a DAG

Git Repository for the workshop • The git repository with
course materials is here: • https://github.com/artwr/airflow-workshop-dataengconf-sf-2017 • Some of the materials have been ported as a Sphinx documentation • https://artwr.github.io/airflow-workshop-dataengconf-sf-2017/

Anatomy of a DAG: Setup • In Python, you must
import a few things explicitly. In particular, the DAG class and the operators you want to use.

Anatomy of a DAG: Default Arguments • Default args contain
some parameters that you can use for all tasks, like the owner, the start_date, the number of retries. Most of these arguments can be overwritten at the task level.

Anatomy of a DAG: DAG definition • Create a DAG

Anatomy of a DAG: Adding operators • We have imported
operators. Sensors usually require a time or a particular resource to check. Operators will require either a simple command or a path to a script (In the BashOperator below, bash_command could be “path/to/script.sh”).

Anatomy of a DAG: Setting dependencies • Finally we want
to define how operators relate to each other in the DAG. You can choose store the task objects and use the set_upstream or set_downstream. A common pattern for us is to store the dependencies as a dictionary, iterate over the items and use set_dependency.

Architecture

Metadata DB Architecture

Scheduler Metadata DB Architecture

Scheduler Metadata DB Worker Architecture

Scheduler Metadata DB Webserver Worker Architecture

Scheduler Metadata DB Webserver Worker Code repository Architecture

Architecture

Architecture Worker

Architecture Worker Worker

Scheduler/Executor Airflow •You have a choice of Executors which enable
different ways to distribute tasks (See : https://airflow.incubator.apache.org/configuration.html#): • SequentialExecutor • LocalExecutor • CeleryExecutor • MesosExecutor (community contributed) •The SequentialExecutor will only execute one task a a time in process. •The LocalExecutor uses local processes. The number of processes can be scaled with the machine. •Celery and Mesos are a way to handle multiple worker machines to scale out.

Challenges: DAGs are file based •Dynamic DAGs to the rescue.
•A common pattern that we use are DAG factories to create DAGs based on configurations. •The configuration can live in static config files or a Database. •One thing to remember is that Airflow is geared towards slowly changing DAGs.

Challenges: State •Propagating State in a distributed system is hard.
•Multiple states to handle helpful things like automated retries, skipping tasks, detecting scheduling locks. (https://github.com/apache/incubator- airflow/blob/master/airflow/utils/state.py#L26-L57) •We have addressed a decent amount of those issues but are still discovering edge cases.

Challenges: Security •Authentication : Currently support for LDAP, there is
pluggable auth possible. •Authorization : Right now mostly based on Flask. - Usually 3 level: Not logged in, Logged in, Superuser. - It is possible to hide some pages/views based on this. • Access control: Pretty wide right now.

What should you know to get started?

Best Practices for deployment

Getting Started with deploying Airflow •Usually people start their proof
of concept with running the LocalExecutor. •In this case you need to have a production ready metadata db like MySQL or Postgres. •The scheduler is still the weakest link. Enabling service monitoring using something like runit, monit etc…

Metadata Database •As the number of jobs you run on
Airflow increases, so does the load on the Airflow database. It is not uncommon for the Airflow database to require a decent amount of CPU if you execute a large number of concurrent tasks. (We are working on reducing the db load) •SQLite is used for tutorials but cannot handle concurrent connections. We highly recommend switching to MySQL/MariaDB or Postgres. •Some people have tried other databases, but we cannot currently test against them, so it might break in the future.

Deploying DAGs •Put your DAGs in source control. There are
several methods to get them to the worker machines : - Pulling from a SCM repository with cron. - Using a deploy system to unzip an archive of the DAGs. •The main things to remember is that Python processes will keep the version they have in memory unless specifically refreshed. This can be a problem for a long running web server, where you can see a lag between the web server and what is deployed. A refresh can be triggered via the UI or API.

Best Practices for Pipelines

Monitoring and Alerting on your DAGs •Enable the email feature
and EmailOperator/SlackOperator for monitoring completion and failure. •Ease of monitoring will help you keep track of your jobs as their number grows. •Checkout the SLA feature to know when your jobs are not completing on time. •If you have more custom needs, airflow supports arbitrary callbacks in Python on success, failure and retry.

Best Practices about DAG building: Architecture •Try to make you
tasks idempotent (drop partition/insert overwrite/delete output files before writing them). Airflow will then be able to handle retrying for you in case of failure. •Common patterns are : - Sensor -> Transfer (Extract) -> Transform -> Store results (Load) - Stage transformed data -> run data quality checks -> move to final location.

Best Practices about DAG building: Managing resources •You can setup
pools for resource management. Pools are a way to limit the concurrency of expensive tasks across DAGs (For instance running Spark jobs, or accessing a RDBMS). They can be setup via the UI. •If you need specialized workers, the CeleryExecutor allows you to setup different queues and workers consuming different types of tasks. The LocalExecutor is does not have this concept, but a similar result can be obtained by sharding by DAGs on separate boxes. •If you use the cgroup task runner, you have the opportunity to limit resource usage (CPU, memory) on a per task basis.

Configuration as Code! As an alternative to static YAML, JSON
or worse: drag and drop tools • Code is more expressive, powerful & compact • Reusable components (functions, classes, object factories) come naturally in code • An API has a clear specification with defaults, input validation and useful methods • Nothing gets lost into translation: Python is the language of Airflow. • The API can be derived/extended as part of the workflow code. Build your own Operators, Hooks etc… • In its minimal form, it’s as simple as static configuration

The future of Airflow

• Back in 2014 we were using Chronos a Framework
for long running jobs on top of Mesos. • Defining data dependencies was near impossible. Debugging why data was not landing on time was really difficult. • Max Beauchemin joined Airbnb and was interested in open sourcing an entirely rewritten version of Data Swarm, the job authoring platform at Facebook. • Introduced Jan 2015 for our main warehouse pipeline. • Open sourced in early 2015, donated to the Apache Foundation for Incubation in march 2016. Quick history of Airflow @ Airbnb 52

• The community is currently working on version 1.8.1rc2. To
be released soon. • The focus has been on stability and performance enhancements. • We hope to graduate to Top Level Project this year. • We are looking for contributors. Check out the project and come hack with us. Apache Airflow 53

Resources

Airflow Resources • Gitter is fairly active at https://gitter.im/apache/incubator-airflow and
has a lot of user to user help. • If you have more advanced questions, the dev mailing list at http://mail- archives.apache.org/mod_mbox/incubator-airflow-dev/ has the core developers on it. • The documentation is available at https://airflow.incubator.apache.org/ • The project also has a wiki : https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home 55

Airflow Talks • The Bay Area Airflow meet up :
https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/ • Matt Davis at PyBay 2016: https://speakerdeck.com/pybay2016/matt-davis-a-practical-introduction-to-airflow • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and scheduling with Airflow : https://www.youtube.com/watch?v=60FUHEkcPyY 56

Airflow • Gerard Toonstra, a contributor on the mailing lists
has written some thoughts about ETL with Airflow https://gtoonstra.github.io/etl-with-airflow/ • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and scheduling with Airflow : https://www.youtube.com/watch?v=60FUHEkcPyY 57

Questions?

Other Frameworks built on the Airflow Platform

Airflow?

Airflow? An open source platform to author, orchestrate and monitor
batch processes • It’s the glue that binds your data ecosystem together • It orchestrates tasks in a complex networks of job dependencies • It’s Python all the way down • It’s popular and has a thriving open source community • It’s expressive and dynamic, workflows are defined in code

AutoDAG anyone can schedule a simple query

AutoDAG anyone can schedule a simple query Behind the scene
• Validates your SQL, makes sure it parses • Advises against bad SQL patterns • Introspects your code and infers your dependencies on other tables / partitions • Schedules your workflow, Airflow emails you on failure

Engagement & Growth metrics DAU, WAU, MAU / new, churn,
resurrected, stale and active users • COUNT DISTINCT metrics are complex to compute efficiently • Web companies are obsessed with these metrics! • Typically needs to be computed for many sub-products and many core dimensions

Engagement & Growth metrics DAU, WAU, MAU / new, churn,
resurrected, stale and active users • COUNT DISTINCT metrics are complex to compute efficiently • Web companies are obsessed with these metrics! • Typically needs to be computed for many sub-products and many core dimensions Behind the scene • Translates into a complex workflow is issued for each entry • “Cubes” the data by running multiple groupings • Joins to the user dimension to gather specified demographics • “Backfills” the data since the activation date • Leaves a useful computational trail for deeper analysis • Runs optimized logic • Cuts the long tail of high cardinality dimensions as specified • Delivers summarized data to use in reports and dashboards

per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total

per subject (user, listings, advertiser, …) are a common pattern • Computing the SUM since beginning of time is inefficient, it’s preferable to add up today’s metrics to yesterday’s total Outputs • An efficient pipeline • Easy / efficient backfilling capabilities • A centralized table, partitioned by metric and date, documented by code • Allows for efficient time range deltas by scanning 2 partitions

Experimentation A/B testing at scale (simplified) Define user metrics as
SQL Configure your experiments

Experimentation a small portion of the whole experimentation workflow tasks
backing an individual experiment Wait for source partitions Load into metrics repository Compute atomic data for the experiment Aggregate metric events and compute stats Conceptually Export summary to MySQL

Experimentation ds (partition) metric_source (partition) userid BIGINT dimension_map MAP event_name
STRING value NUMBER metrics_repo data structures overview (simplified) ds (partition) experiment STRING treatment STRING userid BIGINT first_exposure_ts STRING experiment_assignments ds (partition) experiment STRING treatment_name STRING control_name STRING delta DOUBLE pvalue DOUBLE experiment_stats …

Experimentation overlooked complexity in previous slides • user take days
or weeks to go through our main flows • cookie -> userid mapping • event level attributes, dimensional breakdowns • different types of subjects (host, guests, listing, cookie, …) • different types of experimentation (web, mobile, emails, tickets…) • “themes” are defined as sets of metrics • Statistics beyond pvalue and confidence intervals: preventing bias, global impact, time-boxing

Stats Daemon Build database statistics on Hive using Presto •
Monitor the Hive metastore’s partition table for last updated time stamp • for each recently modified partition, generate a single scan query that computes loads of metrics * for numeric value, compute MIN, MAX, AVG, SUM, NULL_COUNT, COUNT DISTINCT, … * for strings, count the number of characters, COUNT_DISTINCT, NULL_COUNT, … * based on naming conventions, add more specific rules * whitelist / blacklist namespaces, regexes, … • Load statistics into MySQL • Used for capacity planning, data quality monitoring, debugging, anomaly detection, alerting, … cluster STRING database STRING table BIGINT partition STRING stat_expr STRING value NUMBER partition_stats

Other Airflow Frameworks • Anomaly detection • Production MySQL exports
• AirOLAP: loads data into druid.io • Email targeting rule engine • Cohort Analysis & user segmentation (prototype) • …

Apache Airflow DataEngConf SF 2017 Workshop

Apache Airflow DataEngConf SF 2017 Workshop

More Decks by Arthur Wiedmer

Other Decks in Programming

Featured

Transcript