Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Apache Airflow as a platform for data engineering frameworks

Using Apache Airflow as a platform for data engineering frameworks

Analysis automation and analytic services is the future of data engineering!
Apache Airflow's DSL makes it natural to build complex DAGs of tasks dynamically, and Airbnb has been leveraging this feature in intricate ways,
creating a wide array of frameworks as dynamic workflows. In this talk, we'll explain the mechanics of dynamic pipeline generation using Apache Airflow (incubating) and present advanced use cases that have been developed at Airbnb, starting going from simple frameworks to more complex ones.

Arthur Wiedmer

January 11, 2017

More Decks by Arthur Wiedmer

Other Decks in Programming


  1. Using Apache Air ow as a platform for data engineering

    frameworks Arthur Wiedmer Airbnb Air ow Meetup Jan 11th, 2017 1 / 24
  2. Contents Air ow @ Airbnb Data Engineering: From Pipelines to

    Frameworks Incremental compute framework Back ll Framework Experimentation Framework And more 2 / 24
  3. Quick Intro I work on the Data Platform team which

    is responsible for both maintaining data infrastructure (clusters, Hadoop, Hbase, Hive, Spark and of course Airflow) and the data pipelines powering the core business metrics in our data warehouse. I started working on Airflow when I joined Airbnb in 2014. I have since worked in both improving Airflow via open source PRs or contributing to internal tools and frameworks. 3 / 24
  4. Air ow @ Airbnb Airflow currently runs ~ 800+ DAGs,

    and between 40k and 80+k tasks a day. We have monthly, daily, hourly and 10 min granularity DAGs. About 100 people have authored or contributed to a DAG. Between 400 and 500 have contributed or modified a config to a framework. We use the CeleryExecutor with Redis as a backend. 5 / 24
  5. Air ow is dynamic Airflow's DSL on top of Python

    makes it fairly flexible for dynamic DAGs : You can create tasks dynamically. for table_name in TABLES: HiveOperator( task_id=table_name + '_curr', hql=""" DROP TABLE IF EXISTS {db_name}.{table_name}_curr; CREATE TABLE {db_name}.{table_name}_curr AS SELECT * FROM {db_name}.{table_name} WHERE ds = '{ds}'; """.format(**locals()), dag=dag) 7 / 24
  6. Air ow is dynamic Airflow's DSL on top of Python

    makes it fairly flexible for dynamic DAGs : You can even create DAGs dynamically. for i in range(10): dag_id = 'foo_{}'.format(i) globals()[dag_id] = DAG(dag_id) Or better create a DAG factory : from airflow import DAG # ^ Important because the dagbag will ignore files that do not import DAG for c in confs: dag_id = '{}_dag'.format(c["job_name"]) backfill_dag = BackfillDAGFactory.build_dag(dag_id, conf=c) 8 / 24
  7. We can automate away (part of) our jobs and write

    frameworks to write pipelines! 9 / 24
  8. Overcoming antipatterns in the warehouse A data scientist want to

    know when is the last time a user has booked with us. He writes: -- antipattern SELECT user_id , MAX(booking_date) AS most_recent_booking_date FROM db.bookings WHERE -- all prior days = 1000s of partitions! ds <= '{{ ds }}' GROUP BY id 11 / 24
  9. But this is not the only example Another common antipattern

    was -- antipattern SELECT user_id , SUM(booking) OVER ( PARTITION BY user_id ORDER BY ds ROWS UNBOUNDED PRECEDING) AS total_bookings_to_date FROM db.bookings WHERE ds <= '{{ ds }}' for cumulative sums on the number of bookings, reviews etc... 12 / 24
  10. A more e cient pattern -- efficient pattern: sum over

    today's data and cumsum through yesterday SELECT u.user_id , SUM(u.bookings) AS c_bookings FROM ( SELECT user_id , bookings FROM db.bookings WHERE ds = '{{ ds }}' -- today's data UNION ALL SELECT user_id , c_bookings as bookings FROM cumsum.bookings_by_users WHERE ds = DATE_SUB('{{ ds }}', 1) -- cumsum through yesterday ) u GROUP BY u.id 13 / 24
  11. Incremental computation as a service We can use a config

    driven framework to compute those values. In this case we use hocon, a simpler JSON to store the information we need. // cumsum configuration file: bookings_by_users.conf { query = """ SELECT user_id , bookings AS c_bookings FROM db.bookings WHERE ds = '{{ ds }}' """ dependencies = [ { table: bookings } ] start_date = "2011-01-01" } 14 / 24
  12. Back lling Oh no, the definition of a metric used

    in the warehouse needs to change! We would like to: Test the new logic on the source data Check the resulting data for issues and compare with the old. Replace the old data with the new. And, of course, we would love to delegate as much of this as possible to Airflow 16 / 24
  13. The Experimentation Reporting Framework We use Airflow to power the

    experimentation reporting framework (ERF). In 2014, the ERF was a large ruby script doing dynamic SQL generation from a single YAML file defining metrics, sources and experiments to render several thousand lines of SQL in a single script. The hive job would finish, once in a while. A large refactor effort led to a rewrite in Python, into several dynamic airflow DAGs. It is our largest DAG by compute time and possibly by number of tasks (Approximately 8000 per dag run). Airflow is worth having just for the ability to run our experiments reliably, and debug when things go wrong. I would show it to you, but Airflow cannot currently render the full graph :) 20 / 24
  14. The Experimentation Reporting Framework Metric Config Files Source/Subjects Aggregations Experiments

    Results Pipeline Dimension Cuts Results Pipeline Experiment DB Assignments Pipeline Exp Data ] 21 / 24
  15. And More We use dynamic generation for more use cases

    : Operations (Canary jobs, SLA monitoring, ) Sqoop exports of production databases Data Quality (Automated stats collection on tables, anomaly detection) Scheduled report generation for analysts (AutoDAG) Dynamic SQL generation and PII stripping for anonymized tables. 23 / 24