Using Apache Airflow as a platform for data engineering frameworks

Using Apache Air ow as a platform for data engineering
frameworks Arthur Wiedmer Airbnb Air ow Meetup Jan 11th, 2017 1 / 24

Contents Air ow @ Airbnb Data Engineering: From Pipelines to
Frameworks Incremental compute framework Back ll Framework Experimentation Framework And more 2 / 24

Quick Intro I work on the Data Platform team which
is responsible for both maintaining data infrastructure (clusters, Hadoop, Hbase, Hive, Spark and of course Airflow) and the data pipelines powering the core business metrics in our data warehouse. I started working on Airflow when I joined Airbnb in 2014. I have since worked in both improving Airflow via open source PRs or contributing to internal tools and frameworks. 3 / 24

Air ow @ Airbnb 4 / 24

Air ow @ Airbnb Airflow currently runs ~ 800+ DAGs,
and between 40k and 80+k tasks a day. We have monthly, daily, hourly and 10 min granularity DAGs. About 100 people have authored or contributed to a DAG. Between 400 and 500 have contributed or modified a config to a framework. We use the CeleryExecutor with Redis as a backend. 5 / 24

Data Engineering Frameworks 6 / 24

Air ow is dynamic Airflow's DSL on top of Python
makes it fairly flexible for dynamic DAGs : You can create tasks dynamically. for table_name in TABLES: HiveOperator( task_id=table_name + '_curr', hql=""" DROP TABLE IF EXISTS {db_name}.{table_name}_curr; CREATE TABLE {db_name}.{table_name}_curr AS SELECT * FROM {db_name}.{table_name} WHERE ds = '{ds}'; """.format(**locals()), dag=dag) 7 / 24

Air ow is dynamic Airflow's DSL on top of Python
makes it fairly flexible for dynamic DAGs : You can even create DAGs dynamically. for i in range(10): dag_id = 'foo_{}'.format(i) globals()[dag_id] = DAG(dag_id) Or better create a DAG factory : from airflow import DAG # ^ Important because the dagbag will ignore files that do not import DAG for c in confs: dag_id = '{}_dag'.format(c["job_name"]) backfill_dag = BackfillDAGFactory.build_dag(dag_id, conf=c) 8 / 24

We can automate away (part of) our jobs and write
frameworks to write pipelines! 9 / 24

The Incremental Computation Frameworks 10 / 24

Overcoming antipatterns in the warehouse A data scientist want to
know when is the last time a user has booked with us. He writes: -- antipattern SELECT user_id , MAX(booking_date) AS most_recent_booking_date FROM db.bookings WHERE -- all prior days = 1000s of partitions! ds <= '{{ ds }}' GROUP BY id 11 / 24

But this is not the only example Another common antipattern
was -- antipattern SELECT user_id , SUM(booking) OVER ( PARTITION BY user_id ORDER BY ds ROWS UNBOUNDED PRECEDING) AS total_bookings_to_date FROM db.bookings WHERE ds <= '{{ ds }}' for cumulative sums on the number of bookings, reviews etc... 12 / 24

A more e cient pattern -- efficient pattern: sum over
today's data and cumsum through yesterday SELECT u.user_id , SUM(u.bookings) AS c_bookings FROM ( SELECT user_id , bookings FROM db.bookings WHERE ds = '{{ ds }}' -- today's data UNION ALL SELECT user_id , c_bookings as bookings FROM cumsum.bookings_by_users WHERE ds = DATE_SUB('{{ ds }}', 1) -- cumsum through yesterday ) u GROUP BY u.id 13 / 24

Incremental computation as a service We can use a config
driven framework to compute those values. In this case we use hocon, a simpler JSON to store the information we need. // cumsum configuration file: bookings_by_users.conf { query = """ SELECT user_id , bookings AS c_bookings FROM db.bookings WHERE ds = '{{ ds }}' """ dependencies = [ { table: bookings } ] start_date = "2011-01-01" } 14 / 24

Back ll Framework 15 / 24

Back lling Oh no, the definition of a metric used
in the warehouse needs to change! We would like to: Test the new logic on the source data Check the resulting data for issues and compare with the old. Replace the old data with the new. And, of course, we would love to delegate as much of this as possible to Airflow 16 / 24

Enter the Back ll Framework 17 / 24

Enter the Back ll Framework 18 / 24

Experimentation Reporting Framework 19 / 24

The Experimentation Reporting Framework We use Airflow to power the
experimentation reporting framework (ERF). In 2014, the ERF was a large ruby script doing dynamic SQL generation from a single YAML file defining metrics, sources and experiments to render several thousand lines of SQL in a single script. The hive job would finish, once in a while. A large refactor effort led to a rewrite in Python, into several dynamic airflow DAGs. It is our largest DAG by compute time and possibly by number of tasks (Approximately 8000 per dag run). Airflow is worth having just for the ability to run our experiments reliably, and debug when things go wrong. I would show it to you, but Airflow cannot currently render the full graph :) 20 / 24

The Experimentation Reporting Framework Metric Conﬁg Files Source/Subjects Aggregations Experiments
Results Pipeline Dimension Cuts Results Pipeline Experiment DB Assignments Pipeline Exp Data ] 21 / 24

And More 22 / 24

And More We use dynamic generation for more use cases
: Operations (Canary jobs, SLA monitoring, ) Sqoop exports of production databases Data Quality (Automated stats collection on tables, anomaly detection) Scheduled report generation for analysts (AutoDAG) Dynamic SQL generation and PII stripping for anonymized tables. 23 / 24

Questions ? 24 / 24

Using Apache Airflow as a platform for data eng...

Using Apache Airflow as a platform for data engineering frameworks

Arthur Wiedmer

More Decks by Arthur Wiedmer

Other Decks in Programming

Featured

Transcript

Using Apache Air ow as a platform for data engineering

Contents Air ow @ Airbnb Data Engineering: From Pipelines to

Quick Intro I work on the Data Platform team which

Air ow @ Airbnb 4 / 24

Air ow @ Airbnb Airflow currently runs ~ 800+ DAGs,

Data Engineering Frameworks 6 / 24

Air ow is dynamic Airflow's DSL on top of Python

Air ow is dynamic Airflow's DSL on top of Python

We can automate away (part of) our jobs and write

The Incremental Computation Frameworks 10 / 24

Overcoming antipatterns in the warehouse A data scientist want to

But this is not the only example Another common antipattern

A more e cient pattern -- efficient pattern: sum over

Incremental computation as a service We can use a config

Back ll Framework 15 / 24

Back lling Oh no, the definition of a metric used

Enter the Back ll Framework 17 / 24

Enter the Back ll Framework 18 / 24

Experimentation Reporting Framework 19 / 24

The Experimentation Reporting Framework We use Airflow to power the

The Experimentation Reporting Framework Metric Conﬁg Files Source/Subjects Aggregations Experiments

And More 22 / 24

And More We use dynamic generation for more use cases

Questions ? 24 / 24