Operating Data Pipeline with Airflow @ Slack

Ananth Packkildurai April 11, 2018 1 Operating Data Pipeline with
Airflow

Public launch: 2014 1000+ employees across 7 countries worldwide HQ
in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack

March 2016 5 350+ 2M Data Engineers Slack employees Active
users

April 2018 10 1000+ 6M Data Engineers Slack employees Active
users

Data usage 1 in 2 per week 500+ 600k access
data warehouse Tables Events per sec at peak

Airflow stats 240+ 5400+ 68 Active Dags Tasks Per Day
Contributors

Agenda 1. Airflow Infrastructure 2. Scale Airflow Executor 3. Pipeline
Operations 4. Alerting and monitoring

Airflow infrastructure

Airflow infrastructure • Local Executor • Tarball code deployment •
Continuous deployment with Jenkins • Flake8, yapf & pytest • `airflow.sh` shell utility to ensure consistent development environment for all the users.

Scale Airflow Executor

It’s just Airflow being Airflow • Why my task is
not running • Airflow deadlock again • Airflow not scheduling any tasks Scale Airflow Executor

Airflow CPU usage

Airflow Multi Retryable Sensors

Retryable Sensors CPU usage Non-Retryable Sensors Load Retryable Sensors Load

Pipeline Operations

Airflow fallacies • The upstream task success is reliable. •
The task remain static after the success state. • The DAG structure is static. • The data quality not part of a task life cycle.

Mario: Global DAG operator

Hive Partition Sensor Operator Airflow operations 1. Check task success
state 2. Check Hive metastore for partition 3. Check S3 path for the `_SUCCESS` file DQ Check DAG cleanup delete_dag <dag name>

test_external_tasks DAG Policy Validator Check if external tasks point to
valid DAGs and tasks. test_circular_dependencies Check if tasks have circular dependencies *across* DAGs. test_priority_weight Check that production tasks do not depend on a lower priority task. test_on_failure Require that high-priority DAGs have an on-failure alert.

test_sla DAG Policy Validator Require that high-priority DAGs have an
SLA. test_sla_timing SLAs timing should make sense. No job should depend on a task that has an equal or longer SLA than it does. test_has_retry_and_success _callbacks Require an on_success_callback for tasks with an on_retry_callback. test_require_dq_for_prod Require SQ check for all the high priority tasks.

Alerting and Monitoring

Alerting and Monitoring • Alerting should be reliable • Alerts
should be actionable. • Alert when it really matters. • Suppress repeatable alerts.

OnCall Alert callback

Sample Alerts

A little too quiet

What is next?

What is next? • Increase pipeline visibility (user action, task
life cycle etc) • Data Lineage • Airflow Kubernetes Executors

Thank You! 31

Operating Data Pipeline with Airflow @ Slack

Operating Data Pipeline with Airflow @ Slack

Ananth Packkildurai

More Decks by Ananth Packkildurai

Other Decks in Programming

Featured

Transcript

Ananth Packkildurai April 11, 2018 1 Operating Data Pipeline with

Public launch: 2014 1000+ employees across 7 countries worldwide HQ

March 2016 5 350+ 2M Data Engineers Slack employees Active

April 2018 10 1000+ 6M Data Engineers Slack employees Active

Data usage 1 in 2 per week 500+ 600k access

Airflow stats 240+ 5400+ 68 Active Dags Tasks Per Day

Agenda 1. Airflow Infrastructure 2. Scale Airflow Executor 3. Pipeline

Airflow infrastructure

Airflow infrastructure • Local Executor • Tarball code deployment •

Scale Airflow Executor

It’s just Airflow being Airflow • Why my task is

Airflow CPU usage

Airflow Multi Retryable Sensors

Retryable Sensors CPU usage Non-Retryable Sensors Load Retryable Sensors Load

Pipeline Operations

Airflow fallacies • The upstream task success is reliable. •

Mario: Global DAG operator

Hive Partition Sensor Operator Airflow operations 1. Check task success

test_external_tasks DAG Policy Validator Check if external tasks point to

test_sla DAG Policy Validator Require that high-priority DAGs have an

Alerting and Monitoring

Alerting and Monitoring • Alerting should be reliable • Alerts

OnCall Alert callback

OnCall Alert callback

Sample Alerts

Sample Alerts

Sample Alerts

A little too quiet

What is next?

What is next? • Increase pipeline visibility (user action, task

Thank You! 31