Slide 1

Slide 1 text

Sourabh Bajaj, Software Engineer 18 March 2016 To and Fro from Amazon Redshift Extending our workflow service for use cases beyond ETL

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

O U R M I S S I O N : Universal access to the world’s best education

Slide 4

Slide 4 text

3.6M course completions 1,000+ active courses 140 partner institutions 19M registered learners By the numbers

Slide 5

Slide 5 text

About me ● Georgia Tech, CS ● Analytics at Coursera ● @sb2nov Machine Learning Distributed Systems Teaching and Learning Computational Finance

Slide 6

Slide 6 text

● Dataduct ● 3Cs ● Beyond ETL Agenda

Slide 7

Slide 7 text

Dataduct 1. Wrapper on AWS Data Pipeline 2. Built in house at Coursera 3. github.com/coursera/dataduct

Slide 8

Slide 8 text

Dataduct

Slide 9

Slide 9 text

Dataduct #1: Fault Tolerance Recover from machine or transient network failures.

Slide 10

Slide 10 text

Dataduct #1: Fault Tolerance Recover from machine or transient network failures. #2: Scheduling Ability to run at different frequencies

Slide 11

Slide 11 text

Dataduct #1: Fault Tolerance Recover from machine or transient network failures. #2: Scheduling Ability to run at different frequencies #3: Resource Management Manage EC2 / EMR resources required

Slide 12

Slide 12 text

Dataduct #4: Dependency Management Manage dependencies on other pipelines and steps

Slide 13

Slide 13 text

Dataduct #4: Dependency Management Manage dependencies on other pipelines and steps #5: Extensibility Easy to run new types of jobs and steps

Slide 14

Slide 14 text

Dataduct #4: Dependency Management Manage dependencies on other pipelines and steps #5: Extensibility Easy to run new types of jobs and steps #6: Developer Friendliness Easy development and deployment

Slide 15

Slide 15 text

3 Cs Collecting Curating Capitalizing

Slide 16

Slide 16 text

Case Study: Instructor Dashboards

Slide 17

Slide 17 text

Case study: Instructor dashboards Student demographics

Slide 18

Slide 18 text

Student demographics Identify learner misconceptions Case study: Instructor dashboards

Slide 19

Slide 19 text

Student demographics Highlight learner dropout points Identify learner misconceptions Case study: Instructor dashboards

Slide 20

Slide 20 text

Student demographics Highlight learner dropout points Identify learner misconceptions Case study: Instructor dashboards

Slide 21

Slide 21 text

3 Cs Collecting Curating Capitalizing

Slide 22

Slide 22 text

Collecting Build systems that make it easy to collect data. Collecting Curating Capitalizing

Slide 23

Slide 23 text

Collecting services Collecting Curating Capitalizing web mobile Eventing service Redshift

Slide 24

Slide 24 text

Netflix Aegisthus Collecting Collecting Curating Capitalizing Dataduct + Redshift Collecting

Slide 25

Slide 25 text

Collecting Collecting Curating Capitalizing def processBranches(branchPipe: TypedPipe[BranchModel], outputPath: String): Unit = { branchPipe .map { branch => (StringKey(branch.branchId).key, StringKey(branch.courseId).key, branch.changesDescription.map(_.value).getOrElse("")) } .write(TypedTsv[COURSE_BRANCHES_OUTPUT_FORMAT](outputPath))} Collecting

Slide 26

Slide 26 text

Collecting ● Definition in YAML ● Steps ● Visualization ● Reusable code Collecting Curating Capitalizing steps: - type: extract-from-rds sql: | SELECT instructor_id, ,course_id ,rank FROM courses_instructorincourse; hostname: host_db_1 database: master - type: load-into-staging-table table: staging.instructors_sessions - type: reload-prod-table source: staging.instructors_sessions destination: prod.instructors_sessions

Slide 27

Slide 27 text

Case study: Instructor dashboards Collecting Curating Capitalizing Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner

Slide 28

Slide 28 text

3 Cs Collecting Curating Capitalizing

Slide 29

Slide 29 text

Curating Data quality Collecting Curating Capitalizing 1. Correctness 2. Completeness 3. Interpretability

Slide 30

Slide 30 text

Curating Standardize the business definitions Collecting Curating Capitalizing BI schema raw schema analyses

Slide 31

Slide 31 text

Curating Standardization Collecting Curating Capitalizing Steps: - step_type: pipeline-dependencies name: wait_for_dependencies dependent_pipelines: - raw_events - recommendations - step_type: create-update-sql name: discovery_clicks depends_on: wait_for_dependencies script: scripts/bi.discovery_clicks.sql table_definition: bi.discovery_clicks.sql - step_type: create-update-sql name: discovery_impressions depends_on: wait_for_dependencies script: scripts/bi.discovery_impressions.sql table_definition:bi.discovery_impressions.sql

Slide 32

Slide 32 text

Case study: Instructor dashboards Collecting Curating Capitalizing Redshift BI tables Cumulative progress per student & course Dataduct Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner

Slide 33

Slide 33 text

3 Cs Collecting Curating Capitalizing

Slide 34

Slide 34 text

Capitalizing Building data products within production ecosystem Collecting Curating Capitalizing Nostos service Key/Value access Redshift Nostos Loader

Slide 35

Slide 35 text

Capitalizing Building data products within production ecosystem Collecting Curating Capitalizing

Slide 36

Slide 36 text

Capitalizing Nostos Collecting Curating Capitalizing steps: - step_type: nostos-v2 job_name: example index_column: user_id fields: - generator: sql_iterator sql: > SELECT user_id ,session_id::VARCHAR AS entityId1 FROM prod.enrollments WHERE user_id < 100 - generator: sql_iterator is_set: true sql: > SELECT user_id ,session_id::VARCHAR AS entityId2 FROM prod.enrollments WHERE user_id < 100

Slide 37

Slide 37 text

Case study: Instructor dashboards Collecting Curating Capitalizing Materialized progress per course (KVS) Nostos Instructor Redshift BI tables Cumulative progress per student & course Dataduct Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner

Slide 38

Slide 38 text

Beyond ETL Why should we do this ? 1. Leveraging current infrastructure accelerates the team 2. New use cases keep coming up that have helped evolve dataduct

Slide 39

Slide 39 text

Machine Learning ● Daily model updates ● Multistage model training ● Hyperparameter tuning ● Performance benchmarks

Slide 40

Slide 40 text

Experimentation ● Hourly result calculations ● Parameter updates

Slide 41

Slide 41 text

Takeaways: ● Leveraging common infrastructure can really accelerate the team ● Good Infrastructure would be used in really creative ways which you might not anticipate

Slide 42

Slide 42 text

Thank You

Slide 43

Slide 43 text

Questions ?

Slide 44

Slide 44 text

coursera.org/jobs building.coursera.org @CourseraEng