Using Apache Airflow as a platform for data engineering frameworks

Slide 1

Slide 1 text

Using Apache Air ow as a platform for data engineering frameworks Arthur Wiedmer Airbnb Air ow Meetup Jan 11th, 2017 1 / 24

Slide 2

Slide 2 text

Contents Air ow @ Airbnb Data Engineering: From Pipelines to Frameworks Incremental compute framework Back ll Framework Experimentation Framework And more 2 / 24

Slide 3

Slide 3 text

Quick Intro I work on the Data Platform team which is responsible for both maintaining data infrastructure (clusters, Hadoop, Hbase, Hive, Spark and of course Airflow) and the data pipelines powering the core business metrics in our data warehouse. I started working on Airflow when I joined Airbnb in 2014. I have since worked in both improving Airflow via open source PRs or contributing to internal tools and frameworks. 3 / 24

Slide 4

Slide 4 text

Air ow @ Airbnb 4 / 24

Slide 5

Slide 5 text

Air ow @ Airbnb Airflow currently runs ~ 800+ DAGs, and between 40k and 80+k tasks a day. We have monthly, daily, hourly and 10 min granularity DAGs. About 100 people have authored or contributed to a DAG. Between 400 and 500 have contributed or modified a config to a framework. We use the CeleryExecutor with Redis as a backend. 5 / 24

Slide 6

Slide 6 text

Data Engineering Frameworks 6 / 24

Slide 7

Slide 7 text

Air ow is dynamic Airflow's DSL on top of Python makes it fairly flexible for dynamic DAGs : You can create tasks dynamically. for table_name in TABLES: HiveOperator( task_id=table_name + '_curr', hql=""" DROP TABLE IF EXISTS {db_name}.{table_name}_curr; CREATE TABLE {db_name}.{table_name}_curr AS SELECT * FROM {db_name}.{table_name} WHERE ds = '{ds}'; """.format(**locals()), dag=dag) 7 / 24

Slide 8

Slide 8 text

Air ow is dynamic Airflow's DSL on top of Python makes it fairly flexible for dynamic DAGs : You can even create DAGs dynamically. for i in range(10): dag_id = 'foo_{}'.format(i) globals()[dag_id] = DAG(dag_id) Or better create a DAG factory : from airflow import DAG # ^ Important because the dagbag will ignore files that do not import DAG for c in confs: dag_id = '{}_dag'.format(c["job_name"]) backfill_dag = BackfillDAGFactory.build_dag(dag_id, conf=c) 8 / 24

Slide 9

Slide 9 text

We can automate away (part of) our jobs and write frameworks to write pipelines! 9 / 24

Slide 10

Slide 10 text

The Incremental Computation Frameworks 10 / 24

Slide 11

Slide 11 text

Overcoming antipatterns in the warehouse A data scientist want to know when is the last time a user has booked with us. He writes: -- antipattern SELECT user_id , MAX(booking_date) AS most_recent_booking_date FROM db.bookings WHERE -- all prior days = 1000s of partitions! ds <= '{{ ds }}' GROUP BY id 11 / 24

Slide 12

Slide 12 text

But this is not the only example Another common antipattern was -- antipattern SELECT user_id , SUM(booking) OVER ( PARTITION BY user_id ORDER BY ds ROWS UNBOUNDED PRECEDING) AS total_bookings_to_date FROM db.bookings WHERE ds <= '{{ ds }}' for cumulative sums on the number of bookings, reviews etc... 12 / 24

Slide 13

Slide 13 text

A more e cient pattern -- efficient pattern: sum over today's data and cumsum through yesterday SELECT u.user_id , SUM(u.bookings) AS c_bookings FROM ( SELECT user_id , bookings FROM db.bookings WHERE ds = '{{ ds }}' -- today's data UNION ALL SELECT user_id , c_bookings as bookings FROM cumsum.bookings_by_users WHERE ds = DATE_SUB('{{ ds }}', 1) -- cumsum through yesterday ) u GROUP BY u.id 13 / 24

Slide 14

Slide 14 text

Incremental computation as a service We can use a config driven framework to compute those values. In this case we use hocon, a simpler JSON to store the information we need. // cumsum configuration file: bookings_by_users.conf { query = """ SELECT user_id , bookings AS c_bookings FROM db.bookings WHERE ds = '{{ ds }}' """ dependencies = [ { table: bookings } ] start_date = "2011-01-01" } 14 / 24

Slide 15

Slide 15 text

Back ll Framework 15 / 24

Slide 16

Slide 16 text

Back lling Oh no, the definition of a metric used in the warehouse needs to change! We would like to: Test the new logic on the source data Check the resulting data for issues and compare with the old. Replace the old data with the new. And, of course, we would love to delegate as much of this as possible to Airflow 16 / 24

Slide 17

Slide 17 text

Enter the Back ll Framework 17 / 24

Slide 18

Slide 18 text

Enter the Back ll Framework 18 / 24

Slide 19

Slide 19 text

Experimentation Reporting Framework 19 / 24

Slide 20

Slide 20 text

The Experimentation Reporting Framework We use Airflow to power the experimentation reporting framework (ERF). In 2014, the ERF was a large ruby script doing dynamic SQL generation from a single YAML file defining metrics, sources and experiments to render several thousand lines of SQL in a single script. The hive job would finish, once in a while. A large refactor effort led to a rewrite in Python, into several dynamic airflow DAGs. It is our largest DAG by compute time and possibly by number of tasks (Approximately 8000 per dag run). Airflow is worth having just for the ability to run our experiments reliably, and debug when things go wrong. I would show it to you, but Airflow cannot currently render the full graph :) 20 / 24

Slide 21

Slide 21 text

The Experimentation Reporting Framework Metric Conﬁg Files Source/Subjects Aggregations Experiments Results Pipeline Dimension Cuts Results Pipeline Experiment DB Assignments Pipeline Exp Data ] 21 / 24

Slide 22

Slide 22 text

And More 22 / 24

Slide 23

Slide 23 text

And More We use dynamic generation for more use cases : Operations (Canary jobs, SLA monitoring, ) Sqoop exports of production databases Data Quality (Automated stats collection on tables, anomaly detection) Scheduled report generation for analysts (AutoDAG) Dynamic SQL generation and PII stripping for anonymized tables. 23 / 24

Slide 24

Slide 24 text

Questions ? 24 / 24