Ananth Packkildurai
April 11, 2018
1
Operating Data Pipeline with Airflow
Slide 2
Slide 2 text
Public launch: 2014 1000+ employees across
7 countries worldwide
HQ in San Francisco
Diverse set of industries
including software/technology, retail, media,
telecom and professional services.
About Slack
Slide 3
Slide 3 text
March 2016
5 350+ 2M
Data Engineers Slack employees Active users
Slide 4
Slide 4 text
April 2018
10 1000+ 6M
Data Engineers Slack employees Active users
Slide 5
Slide 5 text
Data usage
1 in 2 per
week
500+ 600k
access data
warehouse
Tables Events per sec at
peak
Slide 6
Slide 6 text
Airflow stats
240+ 5400+ 68
Active Dags Tasks Per Day Contributors
Airflow infrastructure
● Local Executor
● Tarball code deployment
● Continuous deployment with Jenkins
● Flake8, yapf & pytest
● `airflow.sh` shell utility to ensure consistent development environment
for all the users.
Slide 10
Slide 10 text
Scale Airflow Executor
Slide 11
Slide 11 text
It’s just Airflow being Airflow
● Why my task is not running
● Airflow deadlock again
● Airflow not scheduling any tasks
Scale Airflow Executor
Slide 12
Slide 12 text
Airflow CPU usage
Slide 13
Slide 13 text
Airflow Multi Retryable Sensors
Slide 14
Slide 14 text
Retryable Sensors CPU usage
Non-Retryable Sensors Load Retryable Sensors Load
Slide 15
Slide 15 text
Pipeline Operations
Slide 16
Slide 16 text
Airflow fallacies
● The upstream task success is reliable.
● The task remain static after the success state.
● The DAG structure is static.
● The data quality not part of a task life cycle.
Slide 17
Slide 17 text
Mario: Global DAG operator
Slide 18
Slide 18 text
Hive Partition
Sensor Operator
Airflow operations
1. Check task success state
2. Check Hive metastore for partition
3. Check S3 path for the `_SUCCESS` file
DQ Check
DAG cleanup delete_dag
Slide 19
Slide 19 text
test_external_tasks
DAG Policy Validator
Check if external tasks point to valid DAGs and tasks.
test_circular_dependencies Check if tasks have circular dependencies *across* DAGs.
test_priority_weight Check that production tasks do not depend on a lower priority task.
test_on_failure Require that high-priority DAGs have an on-failure alert.
Slide 20
Slide 20 text
test_sla
DAG Policy Validator
Require that high-priority DAGs have an SLA.
test_sla_timing SLAs timing should make sense. No job should depend on a task
that has an equal or longer SLA than it does.
test_has_retry_and_success
_callbacks
Require an on_success_callback for tasks with an
on_retry_callback.
test_require_dq_for_prod Require SQ check for all the high priority tasks.
Slide 21
Slide 21 text
Alerting and Monitoring
Slide 22
Slide 22 text
Alerting and Monitoring
● Alerting should be reliable
● Alerts should be actionable.
● Alert when it really matters.
● Suppress repeatable alerts.
Slide 23
Slide 23 text
OnCall Alert callback
Slide 24
Slide 24 text
OnCall Alert callback
Slide 25
Slide 25 text
Sample Alerts
Slide 26
Slide 26 text
Sample Alerts
Slide 27
Slide 27 text
Sample Alerts
Slide 28
Slide 28 text
A little too quiet
Slide 29
Slide 29 text
What is next?
Slide 30
Slide 30 text
What is next?
● Increase pipeline visibility (user action, task life cycle etc)
● Data Lineage
● Airflow Kubernetes Executors