Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operating Data Pipeline with Airflow @ Slack

Operating Data Pipeline with Airflow @ Slack

This talk covers the incremental steps we took to solve on call nightmares & Airflow scalability issues to make our data pipeline more reliable and simpler to operate.

Ananth Packkildurai

April 12, 2018
Tweet

More Decks by Ananth Packkildurai

Other Decks in Programming

Transcript

  1. Ananth Packkildurai April 11, 2018 1 Operating Data Pipeline with

    Airflow
  2. Public launch: 2014 1000+ employees across 7 countries worldwide HQ

    in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack
  3. March 2016 5 350+ 2M Data Engineers Slack employees Active

    users
  4. April 2018 10 1000+ 6M Data Engineers Slack employees Active

    users
  5. Data usage 1 in 2 per week 500+ 600k access

    data warehouse Tables Events per sec at peak
  6. Airflow stats 240+ 5400+ 68 Active Dags Tasks Per Day

    Contributors
  7. Agenda 1. Airflow Infrastructure 2. Scale Airflow Executor 3. Pipeline

    Operations 4. Alerting and monitoring
  8. Airflow infrastructure

  9. Airflow infrastructure • Local Executor • Tarball code deployment •

    Continuous deployment with Jenkins • Flake8, yapf & pytest • `airflow.sh` shell utility to ensure consistent development environment for all the users.
  10. Scale Airflow Executor

  11. It’s just Airflow being Airflow • Why my task is

    not running • Airflow deadlock again • Airflow not scheduling any tasks Scale Airflow Executor
  12. Airflow CPU usage

  13. Airflow Multi Retryable Sensors

  14. Retryable Sensors CPU usage Non-Retryable Sensors Load Retryable Sensors Load

  15. Pipeline Operations

  16. Airflow fallacies • The upstream task success is reliable. •

    The task remain static after the success state. • The DAG structure is static. • The data quality not part of a task life cycle.
  17. Mario: Global DAG operator

  18. Hive Partition Sensor Operator Airflow operations 1. Check task success

    state 2. Check Hive metastore for partition 3. Check S3 path for the `_SUCCESS` file DQ Check DAG cleanup delete_dag <dag name>
  19. test_external_tasks DAG Policy Validator Check if external tasks point to

    valid DAGs and tasks. test_circular_dependencies Check if tasks have circular dependencies *across* DAGs. test_priority_weight Check that production tasks do not depend on a lower priority task. test_on_failure Require that high-priority DAGs have an on-failure alert.
  20. test_sla DAG Policy Validator Require that high-priority DAGs have an

    SLA. test_sla_timing SLAs timing should make sense. No job should depend on a task that has an equal or longer SLA than it does. test_has_retry_and_success _callbacks Require an on_success_callback for tasks with an on_retry_callback. test_require_dq_for_prod Require SQ check for all the high priority tasks.
  21. Alerting and Monitoring

  22. Alerting and Monitoring • Alerting should be reliable • Alerts

    should be actionable. • Alert when it really matters. • Suppress repeatable alerts.
  23. OnCall Alert callback

  24. OnCall Alert callback

  25. Sample Alerts

  26. Sample Alerts

  27. Sample Alerts

  28. A little too quiet

  29. What is next?

  30. What is next? • Increase pipeline visibility (user action, task

    life cycle etc) • Data Lineage • Airflow Kubernetes Executors
  31. Thank You! 31