Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Apache Airflow as a platform for data engineering frameworks

Using Apache Airflow as a platform for data engineering frameworks

Analysis automation and analytic services is the future of data engineering!
Apache Airflow's DSL makes it natural to build complex DAGs of tasks dynamically, and Airbnb has been leveraging this feature in intricate ways,
creating a wide array of frameworks as dynamic workflows. In this talk, we'll explain the mechanics of dynamic pipeline generation using Apache Airflow (incubating) and present advanced use cases that have been developed at Airbnb, starting going from simple frameworks to more complex ones.

Arthur Wiedmer

January 11, 2017
Tweet

More Decks by Arthur Wiedmer

Other Decks in Programming

Transcript

  1. Using Apache Air ow
    as a platform for
    data engineering frameworks
    Arthur Wiedmer
    Airbnb
    Air ow Meetup Jan 11th, 2017
    1 / 24

    View Slide

  2. Contents
    Air ow @ Airbnb
    Data Engineering: From Pipelines to Frameworks
    Incremental compute framework
    Back ll Framework
    Experimentation Framework
    And more
    2 / 24

    View Slide

  3. Quick Intro
    I work on the Data Platform team which is
    responsible for both maintaining data
    infrastructure (clusters, Hadoop, Hbase, Hive,
    Spark and of course Airflow) and the data pipelines
    powering the core business metrics in our data
    warehouse.
    I started working on Airflow when I joined Airbnb
    in 2014. I have since worked in both improving
    Airflow via open source PRs or contributing to
    internal tools and frameworks.
    3 / 24

    View Slide

  4. Air ow @ Airbnb
    4 / 24

    View Slide

  5. Air ow @ Airbnb
    Airflow currently runs ~ 800+ DAGs, and between
    40k and 80+k tasks a day.
    We have monthly, daily, hourly and 10 min
    granularity DAGs.
    About 100 people have authored or contributed to
    a DAG. Between 400 and 500 have contributed or
    modified a config to a framework.
    We use the CeleryExecutor with Redis as a
    backend.
    5 / 24

    View Slide

  6. Data Engineering Frameworks
    6 / 24

    View Slide

  7. Air ow is dynamic
    Airflow's DSL on top of Python makes it fairly flexible for
    dynamic DAGs :
    You can create tasks dynamically.
    for table_name in TABLES:
    HiveOperator(
    task_id=table_name + '_curr',
    hql="""
    DROP TABLE IF EXISTS {db_name}.{table_name}_curr;
    CREATE TABLE {db_name}.{table_name}_curr AS
    SELECT * FROM {db_name}.{table_name}
    WHERE ds = '{ds}';
    """.format(**locals()),
    dag=dag)
    7 / 24

    View Slide

  8. Air ow is dynamic
    Airflow's DSL on top of Python makes it fairly flexible for
    dynamic DAGs :
    You can even create DAGs dynamically.
    for i in range(10):
    dag_id = 'foo_{}'.format(i)
    globals()[dag_id] = DAG(dag_id)
    Or better create a DAG factory :
    from airflow import DAG
    # ^ Important because the dagbag will ignore files that do not import DAG
    for c in confs:
    dag_id = '{}_dag'.format(c["job_name"])
    backfill_dag = BackfillDAGFactory.build_dag(dag_id, conf=c)
    8 / 24

    View Slide

  9. We can automate away (part of) our
    jobs and write frameworks to write
    pipelines!
    9 / 24

    View Slide

  10. The Incremental Computation
    Frameworks
    10 / 24

    View Slide

  11. Overcoming antipatterns in the
    warehouse
    A data scientist want to know when is the last time a user has
    booked with us. He writes:
    -- antipattern
    SELECT
    user_id
    , MAX(booking_date) AS most_recent_booking_date
    FROM
    db.bookings
    WHERE
    -- all prior days = 1000s of partitions!
    ds <= '{{ ds }}'
    GROUP BY
    id
    11 / 24

    View Slide

  12. But this is not the only example
    Another common antipattern was
    -- antipattern
    SELECT
    user_id
    , SUM(booking) OVER (
    PARTITION BY user_id
    ORDER BY ds ROWS UNBOUNDED PRECEDING)
    AS total_bookings_to_date
    FROM
    db.bookings
    WHERE
    ds <= '{{ ds }}'
    for cumulative sums on the number of bookings, reviews etc...
    12 / 24

    View Slide

  13. A more e cient pattern
    -- efficient pattern: sum over today's data and cumsum through yesterday
    SELECT
    u.user_id
    , SUM(u.bookings) AS c_bookings
    FROM (
    SELECT
    user_id
    , bookings
    FROM
    db.bookings
    WHERE
    ds = '{{ ds }}' -- today's data
    UNION ALL
    SELECT
    user_id
    , c_bookings as bookings
    FROM
    cumsum.bookings_by_users
    WHERE
    ds = DATE_SUB('{{ ds }}', 1) -- cumsum through yesterday
    ) u
    GROUP BY u.id
    13 / 24

    View Slide

  14. Incremental computation as a service
    We can use a config driven framework to compute those
    values. In this case we use hocon, a simpler JSON to store the
    information we need.
    // cumsum configuration file: bookings_by_users.conf
    {
    query = """
    SELECT
    user_id
    , bookings AS c_bookings
    FROM
    db.bookings
    WHERE
    ds = '{{ ds }}'
    """
    dependencies = [ { table: bookings } ]
    start_date = "2011-01-01"
    }
    14 / 24

    View Slide

  15. Back ll Framework
    15 / 24

    View Slide

  16. Back lling
    Oh no, the definition of a metric used in the warehouse needs to
    change!
    We would like to:
    Test the new logic on the source data
    Check the resulting data for issues and compare with the old.
    Replace the old data with the new.
    And, of course, we would love to delegate as much of this as
    possible to Airflow
    16 / 24

    View Slide

  17. Enter the Back ll Framework
    17 / 24

    View Slide

  18. Enter the Back ll Framework
    18 / 24

    View Slide

  19. Experimentation Reporting Framework
    19 / 24

    View Slide

  20. The Experimentation Reporting Framework
    We use Airflow to power the experimentation reporting
    framework (ERF).
    In 2014, the ERF was a large ruby script doing dynamic SQL
    generation from a single YAML file defining metrics, sources
    and experiments to render several thousand lines of SQL in a
    single script. The hive job would finish, once in a while.
    A large refactor effort led to a rewrite in Python, into several
    dynamic airflow DAGs. It is our largest DAG by compute time
    and possibly by number of tasks (Approximately 8000 per
    dag run). Airflow is worth having just for the ability to run
    our experiments reliably, and debug when things go wrong.
    I would show it to you, but Airflow cannot currently render
    the full graph :)
    20 / 24

    View Slide

  21. The Experimentation Reporting Framework
    Metric Config Files
    Source/Subjects Aggregations
    Experiments Results Pipeline
    Dimension Cuts Results Pipeline
    Experiment DB
    Assignments Pipeline
    Exp Data
    ]
    21 / 24

    View Slide

  22. And More
    22 / 24

    View Slide

  23. And More
    We use dynamic generation for more use cases :
    Operations (Canary jobs, SLA monitoring, )
    Sqoop exports of production databases
    Data Quality (Automated stats collection on tables, anomaly
    detection)
    Scheduled report generation for analysts (AutoDAG)
    Dynamic SQL generation and PII stripping for anonymized
    tables.
    23 / 24

    View Slide

  24. Questions ?
    24 / 24

    View Slide