Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operating Data Pipeline with Airflow @ Slack

Operating Data Pipeline with Airflow @ Slack

This talk covers the incremental steps we took to solve on call nightmares & Airflow scalability issues to make our data pipeline more reliable and simpler to operate.


Ananth Packkildurai

April 12, 2018


  1. Ananth Packkildurai April 11, 2018 1 Operating Data Pipeline with

  2. Public launch: 2014 1000+ employees across 7 countries worldwide HQ

    in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack
  3. March 2016 5 350+ 2M Data Engineers Slack employees Active

  4. April 2018 10 1000+ 6M Data Engineers Slack employees Active

  5. Data usage 1 in 2 per week 500+ 600k access

    data warehouse Tables Events per sec at peak
  6. Airflow stats 240+ 5400+ 68 Active Dags Tasks Per Day

  7. Agenda 1. Airflow Infrastructure 2. Scale Airflow Executor 3. Pipeline

    Operations 4. Alerting and monitoring
  8. Airflow infrastructure

  9. Airflow infrastructure • Local Executor • Tarball code deployment •

    Continuous deployment with Jenkins • Flake8, yapf & pytest • `airflow.sh` shell utility to ensure consistent development environment for all the users.
  10. Scale Airflow Executor

  11. It’s just Airflow being Airflow • Why my task is

    not running • Airflow deadlock again • Airflow not scheduling any tasks Scale Airflow Executor
  12. Airflow CPU usage

  13. Airflow Multi Retryable Sensors

  14. Retryable Sensors CPU usage Non-Retryable Sensors Load Retryable Sensors Load

  15. Pipeline Operations

  16. Airflow fallacies • The upstream task success is reliable. •

    The task remain static after the success state. • The DAG structure is static. • The data quality not part of a task life cycle.
  17. Mario: Global DAG operator

  18. Hive Partition Sensor Operator Airflow operations 1. Check task success

    state 2. Check Hive metastore for partition 3. Check S3 path for the `_SUCCESS` file DQ Check DAG cleanup delete_dag <dag name>
  19. test_external_tasks DAG Policy Validator Check if external tasks point to

    valid DAGs and tasks. test_circular_dependencies Check if tasks have circular dependencies *across* DAGs. test_priority_weight Check that production tasks do not depend on a lower priority task. test_on_failure Require that high-priority DAGs have an on-failure alert.
  20. test_sla DAG Policy Validator Require that high-priority DAGs have an

    SLA. test_sla_timing SLAs timing should make sense. No job should depend on a task that has an equal or longer SLA than it does. test_has_retry_and_success _callbacks Require an on_success_callback for tasks with an on_retry_callback. test_require_dq_for_prod Require SQ check for all the high priority tasks.
  21. Alerting and Monitoring

  22. Alerting and Monitoring • Alerting should be reliable • Alerts

    should be actionable. • Alert when it really matters. • Suppress repeatable alerts.
  23. OnCall Alert callback

  24. OnCall Alert callback

  25. Sample Alerts

  26. Sample Alerts

  27. Sample Alerts

  28. A little too quiet

  29. What is next?

  30. What is next? • Increase pipeline visibility (user action, task

    life cycle etc) • Data Lineage • Airflow Kubernetes Executors
  31. Thank You! 31