Operating Data Pipeline with Airflow @ Slack

Slide 1

Slide 1 text

Ananth Packkildurai April 11, 2018 1 Operating Data Pipeline with Airflow

Slide 2

Slide 2 text

Public launch: 2014 1000+ employees across 7 countries worldwide HQ in San Francisco Diverse set of industries including software/technology, retail, media, telecom and professional services. About Slack

Slide 3

Slide 3 text

March 2016 5 350+ 2M Data Engineers Slack employees Active users

Slide 4

Slide 4 text

April 2018 10 1000+ 6M Data Engineers Slack employees Active users

Slide 5

Slide 5 text

Data usage 1 in 2 per week 500+ 600k access data warehouse Tables Events per sec at peak

Slide 6

Slide 6 text

Airflow stats 240+ 5400+ 68 Active Dags Tasks Per Day Contributors

Slide 7

Slide 7 text

Agenda 1. Airflow Infrastructure 2. Scale Airflow Executor 3. Pipeline Operations 4. Alerting and monitoring

Slide 8

Slide 8 text

Airflow infrastructure

Slide 9

Slide 9 text

Airflow infrastructure ● Local Executor ● Tarball code deployment ● Continuous deployment with Jenkins ● Flake8, yapf & pytest ● `airflow.sh` shell utility to ensure consistent development environment for all the users.

Slide 10

Slide 10 text

Scale Airflow Executor

Slide 11

Slide 11 text

It’s just Airflow being Airflow ● Why my task is not running ● Airflow deadlock again ● Airflow not scheduling any tasks Scale Airflow Executor

Slide 12

Slide 12 text

Airflow CPU usage

Slide 13

Slide 13 text

Airflow Multi Retryable Sensors

Slide 14

Slide 14 text

Retryable Sensors CPU usage Non-Retryable Sensors Load Retryable Sensors Load

Slide 15

Slide 15 text

Pipeline Operations

Slide 16

Slide 16 text

Airflow fallacies ● The upstream task success is reliable. ● The task remain static after the success state. ● The DAG structure is static. ● The data quality not part of a task life cycle.

Slide 17

Slide 17 text

Mario: Global DAG operator

Slide 18

Slide 18 text

Hive Partition Sensor Operator Airflow operations 1. Check task success state 2. Check Hive metastore for partition 3. Check S3 path for the `_SUCCESS` file DQ Check DAG cleanup delete_dag

Slide 19

Slide 19 text

test_external_tasks DAG Policy Validator Check if external tasks point to valid DAGs and tasks. test_circular_dependencies Check if tasks have circular dependencies *across* DAGs. test_priority_weight Check that production tasks do not depend on a lower priority task. test_on_failure Require that high-priority DAGs have an on-failure alert.

Slide 20

Slide 20 text

test_sla DAG Policy Validator Require that high-priority DAGs have an SLA. test_sla_timing SLAs timing should make sense. No job should depend on a task that has an equal or longer SLA than it does. test_has_retry_and_success _callbacks Require an on_success_callback for tasks with an on_retry_callback. test_require_dq_for_prod Require SQ check for all the high priority tasks.

Slide 21

Slide 21 text

Alerting and Monitoring

Slide 22

Slide 22 text

Alerting and Monitoring ● Alerting should be reliable ● Alerts should be actionable. ● Alert when it really matters. ● Suppress repeatable alerts.

Slide 23

Slide 23 text

OnCall Alert callback

Slide 24

Slide 24 text

OnCall Alert callback

Slide 25

Slide 25 text

Sample Alerts

Slide 26

Slide 26 text

Sample Alerts

Slide 27

Slide 27 text

Sample Alerts

Slide 28

Slide 28 text

A little too quiet

Slide 29

Slide 29 text

What is next?

Slide 30

Slide 30 text

What is next? ● Increase pipeline visibility (user action, task life cycle etc) ● Data Lineage ● Airflow Kubernetes Executors

Slide 31

Slide 31 text

Thank You! 31