Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow DataEngConf SF 2017 Workshop

Apache Airflow DataEngConf SF 2017 Workshop

These slides cover some of the materials explored at the Apache Airflow DataEngConf. The companion repository is at https://github.com/artwr/airflow-workshop-dataengconf-sf-2017

Arthur Wiedmer

April 26, 2017
Tweet

More Decks by Arthur Wiedmer

Other Decks in Programming

Transcript

  1. Apache Airflow
    Workshop
    ARTHUR WIEDMER / APRIL 2017

    View Slide

  2. • Data Engineer on the Data Platform Team at Airbnb.
    • Working on Airflow since 2014.
    • Apache Airflow committer.
    • Most of my free time is spent with my wife and our 1
    year-old son :)
    About Me
    2

    View Slide

  3. Introductions

    View Slide

  4. A quick intro to Airflow

    View Slide

  5. What is Airflow?
    What can Airflow do for you?

    View Slide

  6. View Slide

  7. Airflow?

    View Slide

  8. Airflow?

    View Slide

  9. • Companies grow to have a complex network of
    processes that have intricate dependencies.
    • Analytics & batch processing are mission critical. They
    serve decision makers and power machine learning
    models that can feed into production.
    • There is a lot of time invested in writing and
    monitoring jobs and troubleshooting issues.
    Why does
    Airflow exist?
    7

    View Slide

  10. An open source platform to author, orchestrate and
    monitor batch processes
    • It’s the glue that binds your data ecosystem together
    • It orchestrates tasks in a complex networks of job
    dependencies
    • It’s Python all the way down
    • It’s popular and has a thriving open source community
    • It’s expressive and dynamic, workflows are defined in
    code
    What is Airflow?
    8

    View Slide

  11. Concepts
    •Workflows are called DAGs for Directed Acyclic Graph.

    View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. Concepts
    • Tasks: Workflows are composed of tasks called Operators.
    • Operators can do pretty much anything that can be run on the Airflow
    machine.
    • We tend to classify operators in 3 categories : Sensors, Operators,
    Transfers.

    View Slide

  16. View Slide

  17. Setting dependencies
    t2.set_upstream(t1)

    View Slide

  18. Architecture

    View Slide

  19. Metadata DB
    Architecture

    View Slide

  20. Scheduler
    Metadata DB
    Architecture

    View Slide

  21. Scheduler
    Metadata DB
    Worker
    Architecture

    View Slide

  22. Scheduler
    Metadata DB
    Webserver
    Worker
    Architecture

    View Slide

  23. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Architecture

    View Slide

  24. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Message Queue
    (Celery)
    Architecture

    View Slide

  25. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Message Queue
    (Celery)
    Architecture
    Worker

    View Slide

  26. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Message Queue
    (Celery)
    Architecture
    Worker
    Worker

    View Slide

  27. What can Airflow do for you?

    View Slide

  28. Monitoring

    View Slide

  29. Monitoring DAG Status

    View Slide

  30. Monitoring DAG Status

    View Slide

  31. Monitoring Gantt Chart style

    View Slide

  32. Monitoring Analytics

    View Slide

  33. Scale

    View Slide

  34. Airflow @ Airbnb : scale
    • We currently run 800+ DAGs and about ~80k tasks as day.
    • We have DAGs running at daily, hourly and 10 minute granularities. We also have ad
    hoc DAGs.
    • About 100 people @ Airbnb have authored or contributed to a DAG directly and
    500 have contributed or modified a configuration to one of our frameworks.
    • We use the Celery executor with Redis as a backend.

    View Slide

  35. Flexibility and Extensibility

    View Slide

  36. Airflow @ Airbnb
    Data Warehousing
    Experimentation Growth Analytics
    Email Targeting
    Sessionization
    Search Ranking
    Infrastructure Monitoring
    Engagement Analytics
    Anomaly Detection
    Operational Work
    Data Exports
    from/to production

    View Slide

  37. Common Pattern
    Abstracted static config
    python / yaml / hocon
    Input
    Web app
    Output
    Derived data
    - or -
    Alerts & Notifications
    Data Processing
    Workflow
    Airflow script

    View Slide

  38. Common Pattern
    Abstracted static config
    python / yaml / hocon
    Input
    Web app
    Output
    Derived data
    - or -
    Alerts & Notifications
    Data Processing
    Workflow
    Airflow script

    View Slide

  39. Common Pattern
    Abstracted static config
    python / yaml / hocon
    Input
    Web app
    Output
    Derived data
    - or -
    Alerts & Notifications
    Data Processing
    Workflow
    Airflow script

    View Slide

  40. Common Pattern
    Abstracted static config
    python / yaml / hocon
    Input
    Web app
    Output
    Derived data
    - or -
    Alerts & Notifications
    Data Processing
    Workflow
    Airflow script

    View Slide

  41. CumSum
    Efficient cumulative metrics computation
    • Live to date metrics per subject (user, listings, advertiser, …) are a common pattern
    • Computing the SUM since beginning of time is inefficient, it’s preferable to add up
    today’s metrics to yesterday’s total

    View Slide

  42. CumSum
    Efficient cumulative metrics computation
    • Live to date metrics per subject (user, listings, advertiser, …) are a common pattern
    • Computing the SUM since beginning of time is inefficient, it’s preferable to add up
    today’s metrics to yesterday’s total Outputs
    • An efficient pipeline
    • Easy / efficient backfilling capabilities
    • A centralized table, partitioned by metric and date,
    documented by code
    • Allows for efficient time range deltas by scanning 2 partitions

    View Slide

  43. Anatomy of a DAG

    View Slide

  44. Git Repository for the workshop
    • The git repository with course materials is here:
    • https://github.com/artwr/airflow-workshop-dataengconf-sf-2017
    • Some of the materials have been ported as a Sphinx documentation
    • https://artwr.github.io/airflow-workshop-dataengconf-sf-2017/

    View Slide

  45. Anatomy of a DAG: Setup
    • In Python, you must import a few things explicitly. In particular, the DAG class
    and the operators you want to use.

    View Slide

  46. Anatomy of a DAG: Default Arguments
    • Default args contain some parameters that you can use for all tasks, like the
    owner, the start_date, the number of retries. Most of these arguments can be
    overwritten at the task level.

    View Slide

  47. Anatomy of a DAG: DAG definition
    • Create a DAG

    View Slide

  48. Anatomy of a DAG: Adding operators
    • We have imported operators. Sensors usually require a time or a particular
    resource to check. Operators will require either a simple command or a path to a
    script (In the BashOperator below, bash_command could be “path/to/script.sh”).

    View Slide

  49. Anatomy of a DAG: Setting dependencies
    • Finally we want to define how operators relate to each other in the DAG. You can
    choose store the task objects and use the set_upstream or set_downstream. A
    common pattern for us is to store the dependencies as a dictionary, iterate over
    the items and use set_dependency.

    View Slide

  50. Architecture

    View Slide

  51. Architecture

    View Slide

  52. Metadata DB
    Architecture

    View Slide

  53. Scheduler
    Metadata DB
    Architecture

    View Slide

  54. Scheduler
    Metadata DB
    Worker
    Architecture

    View Slide

  55. Scheduler
    Metadata DB
    Webserver
    Worker
    Architecture

    View Slide

  56. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Architecture

    View Slide

  57. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Message Queue
    (Celery)
    Architecture

    View Slide

  58. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Message Queue
    (Celery)
    Architecture
    Worker

    View Slide

  59. Scheduler
    Metadata DB
    Webserver
    Worker
    Code repository
    Message Queue
    (Celery)
    Architecture
    Worker
    Worker

    View Slide

  60. Scheduler/Executor Airflow
    •You have a choice of Executors which enable different ways to distribute
    tasks (See : https://airflow.incubator.apache.org/configuration.html#):
    • SequentialExecutor
    • LocalExecutor
    • CeleryExecutor
    • MesosExecutor (community contributed)
    •The SequentialExecutor will only execute one task a a time in process.
    •The LocalExecutor uses local processes. The number of processes can be
    scaled with the machine.
    •Celery and Mesos are a way to handle multiple worker machines to scale out.

    View Slide

  61. Challenges: DAGs are file based
    •Dynamic DAGs to the rescue.
    •A common pattern that we use are DAG factories to create DAGs
    based on configurations.
    •The configuration can live in static config files or a Database.
    •One thing to remember is that Airflow is geared towards slowly
    changing DAGs.

    View Slide

  62. Challenges: State
    •Propagating State in a distributed system is hard.
    •Multiple states to handle helpful things like automated retries,
    skipping tasks, detecting scheduling locks. (https://github.com/apache/incubator-
    airflow/blob/master/airflow/utils/state.py#L26-L57)
    •We have addressed a decent amount of those issues but are still
    discovering edge cases.

    View Slide

  63. Challenges: Security
    •Authentication : Currently support for LDAP, there is pluggable auth
    possible.
    •Authorization : Right now mostly based on Flask.
    - Usually 3 level: Not logged in, Logged in, Superuser.
    - It is possible to hide some pages/views based on this.
    • Access control: Pretty wide right now.

    View Slide

  64. What should you know to get
    started?

    View Slide

  65. Best Practices for deployment

    View Slide

  66. Getting Started with deploying Airflow
    •Usually people start their proof of concept with running the LocalExecutor.
    •In this case you need to have a production ready metadata db like MySQL or
    Postgres.
    •The scheduler is still the weakest link. Enabling service monitoring using
    something like runit, monit etc…

    View Slide

  67. Metadata Database
    •As the number of jobs you run on Airflow increases, so does the load on
    the Airflow database. It is not uncommon for the Airflow database to
    require a decent amount of CPU if you execute a large number of
    concurrent tasks. (We are working on reducing the db load)
    •SQLite is used for tutorials but cannot handle concurrent connections.
    We highly recommend switching to MySQL/MariaDB or Postgres.
    •Some people have tried other databases, but we cannot currently test
    against them, so it might break in the future.

    View Slide

  68. Deploying DAGs
    •Put your DAGs in source control. There are several methods to get them
    to the worker machines :
    - Pulling from a SCM repository with cron.
    - Using a deploy system to unzip an archive of the DAGs.
    •The main things to remember is that Python processes will keep the
    version they have in memory unless specifically refreshed. This can be a
    problem for a long running web server, where you can see a lag between
    the web server and what is deployed. A refresh can be triggered via the UI
    or API.

    View Slide

  69. Best Practices for Pipelines

    View Slide

  70. Monitoring and Alerting on your DAGs
    •Enable the email feature and EmailOperator/SlackOperator for monitoring
    completion and failure.
    •Ease of monitoring will help you keep track of your jobs as their number
    grows.
    •Checkout the SLA feature to know when your jobs are not completing on
    time.
    •If you have more custom needs, airflow supports arbitrary callbacks in Python
    on success, failure and retry.

    View Slide

  71. Best Practices about DAG building: Architecture
    •Try to make you tasks idempotent (drop partition/insert overwrite/delete
    output files before writing them). Airflow will then be able to handle
    retrying for you in case of failure.
    •Common patterns are :
    - Sensor -> Transfer (Extract) -> Transform -> Store results (Load)
    - Stage transformed data -> run data quality checks -> move to final
    location.

    View Slide

  72. Best Practices about DAG building: Managing resources
    •You can setup pools for resource management. Pools are a way to limit
    the concurrency of expensive tasks across DAGs (For instance running
    Spark jobs, or accessing a RDBMS). They can be setup via the UI.
    •If you need specialized workers, the CeleryExecutor allows you to setup
    different queues and workers consuming different types of tasks. The
    LocalExecutor is does not have this concept, but a similar result can be
    obtained by sharding by DAGs on separate boxes.
    •If you use the cgroup task runner, you have the opportunity to limit
    resource usage (CPU, memory) on a per task basis.

    View Slide

  73. Configuration as Code!
    As an alternative to static YAML, JSON or worse: drag and drop tools
    • Code is more expressive, powerful & compact
    • Reusable components (functions, classes, object factories) come naturally in code
    • An API has a clear specification with defaults, input validation and useful methods
    • Nothing gets lost into translation: Python is the language of Airflow.
    • The API can be derived/extended as part of the workflow code. Build your own
    Operators, Hooks etc…
    • In its minimal form, it’s as simple as static configuration

    View Slide

  74. The future of Airflow

    View Slide

  75. • Back in 2014 we were using Chronos a Framework for
    long running jobs on top of Mesos.
    • Defining data dependencies was near impossible.
    Debugging why data was not landing on time was really
    difficult.
    • Max Beauchemin joined Airbnb and was interested in
    open sourcing an entirely rewritten version of Data
    Swarm, the job authoring platform at Facebook.
    • Introduced Jan 2015 for our main warehouse pipeline.
    • Open sourced in early 2015, donated to the Apache
    Foundation for Incubation in march 2016.
    Quick history
    of Airflow
    @ Airbnb
    52

    View Slide

  76. • The community is currently working on version
    1.8.1rc2. To be released soon.
    • The focus has been on stability and performance
    enhancements.
    • We hope to graduate to Top Level Project this year.
    • We are looking for contributors. Check out the project
    and come hack with us.
    Apache Airflow
    53

    View Slide

  77. Resources

    View Slide

  78. Airflow Resources
    • Gitter is fairly active at https://gitter.im/apache/incubator-airflow and has a lot of
    user to user help.
    • If you have more advanced questions, the dev mailing list at http://mail-
    archives.apache.org/mod_mbox/incubator-airflow-dev/ has the core developers on it.
    • The documentation is available at https://airflow.incubator.apache.org/
    • The project also has a wiki :
    https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home
    55

    View Slide

  79. Airflow Talks
    • The Bay Area Airflow meet up :
    https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/
    • Matt Davis at PyBay 2016:
    https://speakerdeck.com/pybay2016/matt-davis-a-practical-introduction-to-airflow
    • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and
    scheduling with Airflow :
    https://www.youtube.com/watch?v=60FUHEkcPyY
    56

    View Slide

  80. Airflow
    • Gerard Toonstra, a contributor on the mailing lists has written some thoughts about
    ETL with Airflow
    https://gtoonstra.github.io/etl-with-airflow/
    • Laura Lorenz at PyData DC 2016 How I learned to time travel, or, data pipelining and
    scheduling with Airflow :
    https://www.youtube.com/watch?v=60FUHEkcPyY
    57

    View Slide

  81. Questions?

    View Slide

  82. Other Frameworks built on the
    Airflow Platform

    View Slide

  83. View Slide

  84. Airflow?

    View Slide

  85. Airflow?

    View Slide

  86. Airflow?
    An open source platform to author, orchestrate and monitor
    batch processes
    • It’s the glue that binds your data ecosystem together
    • It orchestrates tasks in a complex networks of job
    dependencies
    • It’s Python all the way down
    • It’s popular and has a thriving open source community
    • It’s expressive and dynamic, workflows are defined in code

    View Slide

  87. Airflow?
    An open source platform to author, orchestrate and monitor
    batch processes
    • It’s the glue that binds your data ecosystem together
    • It orchestrates tasks in a complex networks of job
    dependencies
    • It’s Python all the way down
    • It’s popular and has a thriving open source community
    • It’s expressive and dynamic, workflows are defined in code

    View Slide

  88. Airflow?
    An open source platform to author, orchestrate and monitor
    batch processes
    • It’s the glue that binds your data ecosystem together
    • It orchestrates tasks in a complex networks of job
    dependencies
    • It’s Python all the way down
    • It’s popular and has a thriving open source community
    • It’s expressive and dynamic, workflows are defined in code

    View Slide

  89. Airflow?
    An open source platform to author, orchestrate and monitor
    batch processes
    • It’s the glue that binds your data ecosystem together
    • It orchestrates tasks in a complex networks of job
    dependencies
    • It’s Python all the way down
    • It’s popular and has a thriving open source community
    • It’s expressive and dynamic, workflows are defined in code

    View Slide

  90. AutoDAG
    anyone can schedule a simple query

    View Slide

  91. AutoDAG
    anyone can schedule a simple query
    Behind the scene
    • Validates your SQL, makes sure it parses
    • Advises against bad SQL patterns
    • Introspects your code and infers your dependencies on other tables / partitions
    • Schedules your workflow, Airflow emails you on failure

    View Slide

  92. AutoDAG
    anyone can schedule a simple query
    Behind the scene
    • Validates your SQL, makes sure it parses
    • Advises against bad SQL patterns
    • Introspects your code and infers your dependencies on other tables / partitions
    • Schedules your workflow, Airflow emails you on failure

    View Slide

  93. Engagement & Growth metrics
    DAU, WAU, MAU / new, churn, resurrected, stale and active users
    • COUNT DISTINCT metrics are complex to compute efficiently
    • Web companies are obsessed with these metrics!
    • Typically needs to be computed for many sub-products and many core dimensions

    View Slide

  94. Engagement & Growth metrics
    DAU, WAU, MAU / new, churn, resurrected, stale and active users
    • COUNT DISTINCT metrics are complex to compute efficiently
    • Web companies are obsessed with these metrics!
    • Typically needs to be computed for many sub-products and many core dimensions
    Behind the scene
    • Translates into a complex workflow is issued for each entry
    • “Cubes” the data by running multiple groupings
    • Joins to the user dimension to gather specified demographics
    • “Backfills” the data since the activation date
    • Leaves a useful computational trail for deeper analysis
    • Runs optimized logic
    • Cuts the long tail of high cardinality dimensions as specified
    • Delivers summarized data to use in reports and dashboards

    View Slide

  95. CumSum
    Efficient cumulative metrics computation
    • Live to date metrics per subject (user, listings, advertiser, …) are a common pattern
    • Computing the SUM since beginning of time is inefficient, it’s preferable to add up
    today’s metrics to yesterday’s total

    View Slide

  96. CumSum
    Efficient cumulative metrics computation
    • Live to date metrics per subject (user, listings, advertiser, …) are a common pattern
    • Computing the SUM since beginning of time is inefficient, it’s preferable to add up
    today’s metrics to yesterday’s total
    Outputs
    • An efficient pipeline
    • Easy / efficient backfilling capabilities
    • A centralized table, partitioned by metric and date,
    documented by code
    • Allows for efficient time range deltas by scanning 2 partitions

    View Slide

  97. Experimentation
    A/B testing at scale (simplified)
    Define user metrics as SQL Configure your experiments

    View Slide

  98. Experimentation
    A/B testing at scale (simplified)
    Define user metrics as SQL Configure your experiments

    View Slide

  99. Experimentation
    a small portion of the whole experimentation workflow
    tasks backing an individual experiment
    Wait for
    source
    partitions
    Load into
    metrics
    repository
    Compute atomic data
    for the experiment
    Aggregate metric
    events and compute
    stats
    Conceptually
    Export summary
    to MySQL

    View Slide

  100. Experimentation
    ds (partition)
    metric_source (partition)
    userid BIGINT
    dimension_map MAP
    event_name STRING
    value NUMBER
    metrics_repo
    data structures overview (simplified)
    ds (partition)
    experiment STRING
    treatment STRING
    userid BIGINT
    first_exposure_ts STRING
    experiment_assignments
    ds (partition)
    experiment STRING
    treatment_name STRING
    control_name STRING
    delta DOUBLE
    pvalue DOUBLE
    experiment_stats

    View Slide

  101. Experimentation
    overlooked complexity in previous slides
    • user take days or weeks to go through our main flows
    • cookie -> userid mapping
    • event level attributes, dimensional breakdowns
    • different types of subjects (host, guests, listing, cookie, …)
    • different types of experimentation (web, mobile, emails, tickets…)
    • “themes” are defined as sets of metrics
    • Statistics beyond pvalue and confidence intervals: preventing bias, global impact, time-boxing

    View Slide

  102. Stats Daemon
    Build database statistics on Hive using Presto
    • Monitor the Hive metastore’s partition table for last updated time stamp
    • for each recently modified partition, generate a single scan query that computes loads
    of metrics
    * for numeric value, compute MIN, MAX, AVG, SUM, NULL_COUNT, COUNT
    DISTINCT, …
    * for strings, count the number of characters, COUNT_DISTINCT, NULL_COUNT,

    * based on naming conventions, add more specific rules
    * whitelist / blacklist namespaces, regexes, …
    • Load statistics into MySQL
    • Used for capacity planning, data quality monitoring, debugging,
    anomaly detection, alerting, …
    cluster STRING
    database STRING
    table BIGINT
    partition STRING
    stat_expr STRING
    value NUMBER
    partition_stats

    View Slide

  103. Other Airflow Frameworks
    • Anomaly detection
    • Production MySQL exports
    • AirOLAP: loads data into druid.io
    • Email targeting rule engine
    • Cohort Analysis & user segmentation (prototype)
    • …

    View Slide

  104. View Slide