Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Operating Data Pipeline with Airflow @ Slack

Operating Data Pipeline with Airflow @ Slack

This talk covers the incremental steps we took to solve on call nightmares & Airflow scalability issues to make our data pipeline more reliable and simpler to operate.

Ananth Packkildurai

April 12, 2018
Tweet

More Decks by Ananth Packkildurai

Other Decks in Programming

Transcript

  1. Ananth Packkildurai
    April 11, 2018
    1
    Operating Data Pipeline with Airflow

    View Slide

  2. Public launch: 2014 1000+ employees across
    7 countries worldwide
    HQ in San Francisco
    Diverse set of industries
    including software/technology, retail, media,
    telecom and professional services.
    About Slack

    View Slide

  3. March 2016
    5 350+ 2M
    Data Engineers Slack employees Active users

    View Slide

  4. April 2018
    10 1000+ 6M
    Data Engineers Slack employees Active users

    View Slide

  5. Data usage
    1 in 2 per
    week
    500+ 600k
    access data
    warehouse
    Tables Events per sec at
    peak

    View Slide

  6. Airflow stats
    240+ 5400+ 68
    Active Dags Tasks Per Day Contributors

    View Slide

  7. Agenda
    1. Airflow Infrastructure
    2. Scale Airflow Executor
    3. Pipeline Operations
    4. Alerting and monitoring

    View Slide

  8. Airflow infrastructure

    View Slide

  9. Airflow infrastructure
    ● Local Executor
    ● Tarball code deployment
    ● Continuous deployment with Jenkins
    ● Flake8, yapf & pytest
    ● `airflow.sh` shell utility to ensure consistent development environment
    for all the users.

    View Slide

  10. Scale Airflow Executor

    View Slide

  11. It’s just Airflow being Airflow
    ● Why my task is not running
    ● Airflow deadlock again
    ● Airflow not scheduling any tasks
    Scale Airflow Executor

    View Slide

  12. Airflow CPU usage

    View Slide

  13. Airflow Multi Retryable Sensors

    View Slide

  14. Retryable Sensors CPU usage
    Non-Retryable Sensors Load Retryable Sensors Load

    View Slide

  15. Pipeline Operations

    View Slide

  16. Airflow fallacies
    ● The upstream task success is reliable.
    ● The task remain static after the success state.
    ● The DAG structure is static.
    ● The data quality not part of a task life cycle.

    View Slide

  17. Mario: Global DAG operator

    View Slide

  18. Hive Partition
    Sensor Operator
    Airflow operations
    1. Check task success state
    2. Check Hive metastore for partition
    3. Check S3 path for the `_SUCCESS` file
    DQ Check
    DAG cleanup delete_dag

    View Slide

  19. test_external_tasks
    DAG Policy Validator
    Check if external tasks point to valid DAGs and tasks.
    test_circular_dependencies Check if tasks have circular dependencies *across* DAGs.
    test_priority_weight Check that production tasks do not depend on a lower priority task.
    test_on_failure Require that high-priority DAGs have an on-failure alert.

    View Slide

  20. test_sla
    DAG Policy Validator
    Require that high-priority DAGs have an SLA.
    test_sla_timing SLAs timing should make sense. No job should depend on a task
    that has an equal or longer SLA than it does.
    test_has_retry_and_success
    _callbacks
    Require an on_success_callback for tasks with an
    on_retry_callback.
    test_require_dq_for_prod Require SQ check for all the high priority tasks.

    View Slide

  21. Alerting and Monitoring

    View Slide

  22. Alerting and Monitoring
    ● Alerting should be reliable
    ● Alerts should be actionable.
    ● Alert when it really matters.
    ● Suppress repeatable alerts.

    View Slide

  23. OnCall Alert callback

    View Slide

  24. OnCall Alert callback

    View Slide

  25. Sample Alerts

    View Slide

  26. Sample Alerts

    View Slide

  27. Sample Alerts

    View Slide

  28. A little too quiet

    View Slide

  29. What is next?

    View Slide

  30. What is next?
    ● Increase pipeline visibility (user action, task life cycle etc)
    ● Data Lineage
    ● Airflow Kubernetes Executors

    View Slide

  31. Thank You!
    31

    View Slide