Slide 1

Slide 1 text

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sourabh Bajaj, Software Engineer, Coursera October 2015 BDT404 Large-Scale ETL Data Flows With Data Pipeline and Dataduct

Slide 2

Slide 2 text

What to Expect from the Session ● Learn about: • How we use AWS Data Pipeline to manage ETL at Coursera • Why we built Dataduct, an open source framework from Coursera for running pipelines • How Dataduct enables developers to write their own ETL pipelines for their services • Best practices for managing pipelines • How to start using Dataduct for your own pipelines

Slide 3

Slide 3 text

Coursera

Slide 4

Slide 4 text

Coursera

Slide 5

Slide 5 text

Coursera

Slide 6

Slide 6 text

120 partners 2.5 million course completions Education at Scale 15 million learners worldwide 1300 courses

Slide 7

Slide 7 text

Data Warehousing at Coursera Amazon Redshift 167 Amazon Redshift users 1200 EDW tables 22 source systems 6 dc1.8xlarge instances 30,000,000 queries run

Slide 8

Slide 8 text

Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline

Slide 9

Slide 9 text

Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline

Slide 10

Slide 10 text

Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline

Slide 11

Slide 11 text

Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline

Slide 12

Slide 12 text

Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline

Slide 13

Slide 13 text

ETL at Coursera 150 Active pipelines 44 Dataduct developers

Slide 14

Slide 14 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 15

Slide 15 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 16

Slide 16 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 17

Slide 17 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 18

Slide 18 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 19

Slide 19 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 20

Slide 20 text

Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development

Slide 21

Slide 21 text

Dataduct

Slide 22

Slide 22 text

● Open source wrapper around AWS Data Pipeline Dataduct

Slide 23

Slide 23 text

Dataduct ● Open source wrapper around AWS Data Pipeline ● It provides: • Code reuse • Extensibility • Command line interface • Staging environment support • Dynamic updates

Slide 24

Slide 24 text

Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct

Slide 25

Slide 25 text

Let’s build some pipelines

Slide 26

Slide 26 text

Pipeline 1: Amazon RDS → Amazon Redshift ● Let’s start with a simple pipeline of pulling data from a relational store to Amazon Redshift Amazon Redshift Amazon RDS Amazon S3 Amazon EC2 AWS Data Pipeline

Slide 27

Slide 27 text

Pipeline 1: Amazon RDS → Amazon Redshift

Slide 28

Slide 28 text

● Definition in YAML ● Steps ● Shared Config ● Visualization ● Overrides ● Reusable code Pipeline 1: Amazon RDS → Amazon Redshift

Slide 29

Slide 29 text

Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Extract RDS • Fetch data from Amazon RDS and output to Amazon S3

Slide 30

Slide 30 text

Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Create-Load-Redshift • Create table if it doesn’t exist and load data using COPY.

Slide 31

Slide 31 text

Pipeline 1: Amazon RDS → Amazon Redshift ● Upsert • Update and insert into the production table from staging.

Slide 32

Slide 32 text

Pipeline 1: Amazon RDS → Amazon Redshift (Tasks) ● Bootstrap • Fully automated • Fetch latest binaries from Amazon S3 for Amazon EC2 / Amazon EMR • Install any updated dependencies on the resource • Make sure that the pipeline would run the latest version of code

Slide 33

Slide 33 text

Pipeline 1: Amazon RDS → Amazon Redshift ● Quality assurance • Primary key violations in the warehouse. • Dropped rows: By comparing the number of rows. • Corrupted rows: By comparing a sample set of rows. • Automatically done within UPSERT

Slide 34

Slide 34 text

Pipeline 1: Amazon RDS → Amazon Redshift ● Teardown • Amazon SNS alerting for failed tasks • Logging of task failures • Monitoring • Run times • Retries • Machine health

Slide 35

Slide 35 text

Pipeline 1: Amazon RDS → Amazon Redshift (Config) ● Visualization • Automatically generated by Dataduct • Allows easy debugging

Slide 36

Slide 36 text

Pipeline 1: Amazon RDS → Amazon Redshift ● Shared Config • IAM roles • AMI • Security group • Retries • Custom steps • Resource paths

Slide 37

Slide 37 text

Pipeline 1: Amazon RDS → Amazon Redshift ● Custom steps • Open-sourced steps can easily be shared across multiple pipelines • You can also create new steps and add them using the config

Slide 38

Slide 38 text

Deploying a pipeline ● Command line interface for all operations usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b] pipeline_definitions [pipeline_definitions ...]

Slide 39

Slide 39 text

Pipeline 2: Cassandra → Amazon Redshift Amazon Redshift Amazon EMR (Scalding) Amazon S3 Amazon S3 Amazon EMR (Aegisthus) Cassandra AWS Data Pipeline

Slide 40

Slide 40 text

● Shell command activity to rescue Pipeline 2: Cassandra → Amazon Redshift

Slide 41

Slide 41 text

● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 Pipeline 2: Cassandra → Amazon Redshift

Slide 42

Slide 42 text

● Shell command activity to rescue ● Priam backups of Cassandra to S3 ● Aegisthus to parse SSTables into Avro dumps Pipeline 2: Cassandra → Amazon Redshift

Slide 43

Slide 43 text

● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 ● Aegisthus to parse SSTables into Avro dumps ● Scalding to process Aegisthus output • Extend the base steps to create more patterns Pipeline 2: Cassandra → Amazon Redshift

Slide 44

Slide 44 text

Pipeline 2: Cassandra → Amazon Redshift

Slide 45

Slide 45 text

● Custom steps • Aegisthus • Scalding Pipeline 2: Cassandra → Amazon Redshift

Slide 46

Slide 46 text

● EMR-Config overrides the defaults Pipeline 2: Cassandra → Amazon Redshift

Slide 47

Slide 47 text

● Multiple output nodes from transform step Pipeline 2: Cassandra → Amazon Redshift

Slide 48

Slide 48 text

● Bootstrap • Save every pipeline definition • Fetching new jar for the Amazon EMR jobs • Specify the same Hadoop / Hive metastore installation Pipeline 2: Cassandra → Amazon Redshift

Slide 49

Slide 49 text

Data products ● We’ve talked about data into the warehouse ● Common pattern: • Wait for dependencies • Computation inside redshift to create derived tables • Amazon EMR activities for more complex process • Load back into MySQL / Cassandra • Product feature queries MySQL / Cassandra ● Used in recommendations, dashboards, and search

Slide 50

Slide 50 text

Recommendations ● Objective • Connecting the learner to right content ● Use cases: • Recommendations email • Course discovery • Reactivation of the users

Slide 51

Slide 51 text

Recommendations ● Computation inside Amazon Redshift to create derived tables for co-enrollments ● Amazon EMR job for model training ● Model file pushed to Amazon S3 ● Prediction API uses the updated model file ● Contract between the prediction and the training layer is via model definition.

Slide 52

Slide 52 text

Internal Dashboard ● Objective • Serve internal dashboards to create a data driven culture ● Use Cases • Key performance indicators for the company • Track results for different A/B experiments

Slide 53

Slide 53 text

Internal Dashboard

Slide 54

Slide 54 text

● Do: • Monitoring (run times, retries, deploys, query times) Learnings

Slide 55

Slide 55 text

● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline Learnings

Slide 56

Slide 56 text

● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod Learnings

Slide 57

Slide 57 text

● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL Learnings

Slide 58

Slide 58 text

● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL • Using read-replicas and backups Learnings

Slide 59

Slide 59 text

● Don’t: • Same code passed to multiple pipelines as a script Learnings

Slide 60

Slide 60 text

● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines Learnings

Slide 61

Slide 61 text

● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies Learnings

Slide 62

Slide 62 text

● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies • Not catching resource timeouts or load delays Learnings

Slide 63

Slide 63 text

Dataduct ● Code reuse ● Extensibility ● Command line interface ● Staging environment support ● Dynamic updates

Slide 64

Slide 64 text

Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct

Slide 65

Slide 65 text

Questions? Also, we are hiring! https://www.coursera.org/jobs

Slide 66

Slide 66 text

Remember to complete your evaluations!

Slide 67

Slide 67 text

Thank you! Also, we are hiring! https://www.coursera.org/jobs Sourabh Bajaj sb2nov @sb2nov