Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-Scale ETL Data Flows With Data Pipeline and Dataduct

Sourabh
October 11, 2015

Large-Scale ETL Data Flows With Data Pipeline and Dataduct

As data volumes grow, managing and scaling data pipelines for ETL and batch processing can be daunting. With more than 13.5 million learners worldwide, hundreds of courses, and thousands of instructors, Coursera manages over a hundred data pipelines for ETL, batch processing, and new product development.

In this session, we dive deep into AWS Data Pipeline and Dataduct, an open source framework built at Coursera to manage pipelines and create reusable patterns to expedite developer productivity. We share the lessons learned during our journey: from basic ETL processes, such as loading data from Amazon RDS to Amazon Redshift, to more sophisticated pipelines to power recommendation engines and search services.

Sourabh

October 11, 2015
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Sourabh Bajaj, Software Engineer, Coursera
    October 2015
    BDT404
    Large-Scale ETL Data Flows
    With Data Pipeline and Dataduct

    View full-size slide

  2. What to Expect from the Session
    ● Learn about:
    • How we use AWS Data Pipeline to manage ETL at Coursera
    • Why we built Dataduct, an open source framework from
    Coursera for running pipelines
    • How Dataduct enables developers to write their own ETL
    pipelines for their services
    • Best practices for managing pipelines
    • How to start using Dataduct for your own pipelines

    View full-size slide

  3. 120
    partners
    2.5 million
    course completions
    Education at Scale
    15 million
    learners worldwide
    1300
    courses

    View full-size slide

  4. Data Warehousing at Coursera
    Amazon Redshift
    167 Amazon Redshift users
    1200 EDW tables
    22 source systems
    6 dc1.8xlarge instances
    30,000,000 queries run

    View full-size slide

  5. Data Flow
    Amazon
    Redshift
    Amazon
    RDS
    Amazon EMR Amazon S3
    Event
    Feeds
    Amazon EC2
    Amazon
    RDS
    Amazon S3
    BI Applications
    Third Party
    Tools
    Cassandra
    Cassandra
    AWS Data Pipeline

    View full-size slide

  6. Data Flow
    Amazon
    Redshift
    Amazon
    RDS
    Amazon EMR Amazon S3
    Event
    Feeds
    Amazon EC2
    Amazon
    RDS
    Amazon S3
    BI Applications
    Third Party
    Tools
    Cassandra
    Cassandra
    AWS Data Pipeline

    View full-size slide

  7. Data Flow
    Amazon
    Redshift
    Amazon
    RDS
    Amazon EMR Amazon S3
    Event
    Feeds
    Amazon EC2
    Amazon
    RDS
    Amazon S3
    BI Applications
    Third Party
    Tools
    Cassandra
    Cassandra
    AWS Data Pipeline

    View full-size slide

  8. Data Flow
    Amazon
    Redshift
    Amazon
    RDS
    Amazon EMR Amazon S3
    Event
    Feeds
    Amazon EC2
    Amazon
    RDS
    Amazon S3
    BI Applications
    Third Party
    Tools
    Cassandra
    Cassandra
    AWS Data Pipeline

    View full-size slide

  9. Data Flow
    Amazon
    Redshift
    Amazon
    RDS
    Amazon EMR Amazon S3
    Event
    Feeds
    Amazon EC2
    Amazon
    RDS
    Amazon S3
    BI Applications
    Third Party
    Tools
    Cassandra
    Cassandra
    AWS Data Pipeline

    View full-size slide

  10. ETL at Coursera
    150 Active pipelines 44 Dataduct developers

    View full-size slide

  11. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  12. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  13. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  14. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  15. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  16. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  17. Requirements for an ETL system
    Fault Tolerance Scheduling Dependency
    Management
    Resource Management Monitoring Easy Development

    View full-size slide

  18. ● Open source wrapper around AWS Data Pipeline
    Dataduct

    View full-size slide

  19. Dataduct
    ● Open source wrapper around AWS Data Pipeline
    ● It provides:
    • Code reuse
    • Extensibility
    • Command line interface
    • Staging environment support
    • Dynamic updates

    View full-size slide

  20. Dataduct
    ● Repository
    • https://github.com/coursera/dataduct
    ● Documentation
    • http://dataduct.readthedocs.org/en/latest/
    ● Installation
    • pip install dataduct

    View full-size slide

  21. Let’s build some pipelines

    View full-size slide

  22. Pipeline 1: Amazon RDS → Amazon Redshift
    ● Let’s start with a simple pipeline of pulling data from a
    relational store to Amazon Redshift
    Amazon
    Redshift
    Amazon
    RDS
    Amazon S3
    Amazon EC2
    AWS Data Pipeline

    View full-size slide

  23. Pipeline 1: Amazon RDS → Amazon Redshift

    View full-size slide

  24. ● Definition in YAML
    ● Steps
    ● Shared Config
    ● Visualization
    ● Overrides
    ● Reusable code
    Pipeline 1: Amazon RDS → Amazon Redshift

    View full-size slide

  25. Pipeline 1: Amazon RDS → Amazon Redshift
    (Steps)
    ● Extract RDS
    • Fetch data from Amazon RDS and output to Amazon S3

    View full-size slide

  26. Pipeline 1: Amazon RDS → Amazon Redshift
    (Steps)
    ● Create-Load-Redshift
    • Create table if it doesn’t exist and load data using COPY.

    View full-size slide

  27. Pipeline 1: Amazon RDS → Amazon Redshift
    ● Upsert
    • Update and insert into the production table from staging.

    View full-size slide

  28. Pipeline 1: Amazon RDS → Amazon Redshift
    (Tasks)
    ● Bootstrap
    • Fully automated
    • Fetch latest binaries from Amazon S3 for Amazon EC2 /
    Amazon EMR
    • Install any updated dependencies on the resource
    • Make sure that the pipeline would run the latest version of code

    View full-size slide

  29. Pipeline 1: Amazon RDS → Amazon Redshift

    Quality assurance
    • Primary key violations in the warehouse.
    • Dropped rows: By comparing the number of rows.
    • Corrupted rows: By comparing a sample set of rows.
    • Automatically done within UPSERT

    View full-size slide

  30. Pipeline 1: Amazon RDS → Amazon Redshift
    ● Teardown
    • Amazon SNS alerting for failed tasks
    • Logging of task failures
    • Monitoring
    • Run times
    • Retries
    • Machine health

    View full-size slide

  31. Pipeline 1: Amazon RDS → Amazon Redshift
    (Config)
    ● Visualization
    • Automatically generated
    by Dataduct
    • Allows easy debugging

    View full-size slide

  32. Pipeline 1: Amazon RDS → Amazon Redshift
    ● Shared Config
    • IAM roles
    • AMI
    • Security group
    • Retries
    • Custom steps
    • Resource paths

    View full-size slide

  33. Pipeline 1: Amazon RDS → Amazon Redshift
    ● Custom steps
    • Open-sourced steps can easily be shared across multiple
    pipelines
    • You can also create new steps and add them using the config

    View full-size slide

  34. Deploying a pipeline
    ● Command line interface for all operations
    usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b]
    pipeline_definitions
    [pipeline_definitions ...]

    View full-size slide

  35. Pipeline 2: Cassandra → Amazon Redshift
    Amazon
    Redshift
    Amazon EMR
    (Scalding)
    Amazon S3
    Amazon S3 Amazon EMR
    (Aegisthus)
    Cassandra
    AWS Data Pipeline

    View full-size slide

  36. ● Shell command activity to rescue
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  37. ● Shell command activity to rescue
    ● Priam backups of Cassandra to Amazon S3
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  38. ● Shell command activity to rescue
    ● Priam backups of Cassandra to S3
    ● Aegisthus to parse SSTables into Avro dumps
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  39. ● Shell command activity to rescue
    ● Priam backups of Cassandra to Amazon S3
    ● Aegisthus to parse SSTables into Avro dumps
    ● Scalding to process Aegisthus output
    • Extend the base steps to create more patterns
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  40. Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  41. ● Custom steps
    • Aegisthus
    • Scalding
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  42. ● EMR-Config overrides the defaults
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  43. ● Multiple output nodes from transform step
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  44. ● Bootstrap
    • Save every pipeline definition
    • Fetching new jar for the Amazon EMR jobs
    • Specify the same Hadoop / Hive metastore installation
    Pipeline 2: Cassandra → Amazon Redshift

    View full-size slide

  45. Data products
    ● We’ve talked about data into the warehouse
    ● Common pattern:
    • Wait for dependencies
    • Computation inside redshift to create derived tables
    • Amazon EMR activities for more complex process
    • Load back into MySQL / Cassandra
    • Product feature queries MySQL / Cassandra
    ● Used in recommendations, dashboards, and search

    View full-size slide

  46. Recommendations
    ● Objective
    • Connecting the learner to right content
    ● Use cases:
    • Recommendations email
    • Course discovery
    • Reactivation of the users

    View full-size slide

  47. Recommendations
    ● Computation inside Amazon Redshift to create derived
    tables for co-enrollments
    ● Amazon EMR job for model training
    ● Model file pushed to Amazon S3
    ● Prediction API uses the updated model file
    ● Contract between the prediction and the training layer is
    via model definition.

    View full-size slide

  48. Internal Dashboard
    ● Objective
    • Serve internal dashboards to create a data driven culture
    ● Use Cases
    • Key performance indicators for the company
    • Track results for different A/B experiments

    View full-size slide

  49. Internal Dashboard

    View full-size slide


  50. Do:

    Monitoring (run times, retries, deploys, query times)
    Learnings

    View full-size slide


  51. Do:

    Monitoring (run times, retries, deploys, query times)
    • Code should live in library instead of scripts being passed to
    every pipeline
    Learnings

    View full-size slide


  52. Do:

    Monitoring (run times, retries, deploys, query times)
    • Code should live in library instead of scripts being passed to
    every pipeline

    Test environment in staging should mimic prod
    Learnings

    View full-size slide


  53. Do:

    Monitoring (run times, retries, deploys, query times)
    • Code should live in library instead of scripts being passed to
    every pipeline

    Test environment in staging should mimic prod

    Shared library to democratize writing of ETL
    Learnings

    View full-size slide


  54. Do:

    Monitoring (run times, retries, deploys, query times)
    • Code should live in library instead of scripts being passed to
    every pipeline

    Test environment in staging should mimic prod

    Shared library to democratize writing of ETL

    Using read-replicas and backups
    Learnings

    View full-size slide


  55. Don’t:

    Same code passed to multiple pipelines as a script
    Learnings

    View full-size slide


  56. Don’t:

    Same code passed to multiple pipelines as a script

    Non version controlled pipelines
    Learnings

    View full-size slide


  57. Don’t:

    Same code passed to multiple pipelines as a script

    Non version controlled pipelines

    Really huge pipelines instead of modular small pipelines with
    dependencies
    Learnings

    View full-size slide


  58. Don’t:

    Same code passed to multiple pipelines as a script

    Non version controlled pipelines

    Really huge pipelines instead of modular small pipelines with
    dependencies

    Not catching resource timeouts or load delays
    Learnings

    View full-size slide

  59. Dataduct
    ● Code reuse
    ● Extensibility
    ● Command line interface
    ● Staging environment support
    ● Dynamic updates

    View full-size slide

  60. Dataduct
    ● Repository
    • https://github.com/coursera/dataduct
    ● Documentation
    • http://dataduct.readthedocs.org/en/latest/
    ● Installation
    • pip install dataduct

    View full-size slide

  61. Questions?
    Also, we are hiring!
    https://www.coursera.org/jobs

    View full-size slide

  62. Remember to complete
    your evaluations!

    View full-size slide

  63. Thank you!
    Also, we are hiring!
    https://www.coursera.org/jobs
    Sourabh Bajaj
    sb2nov
    @sb2nov

    View full-size slide