To and Fro from Amazon Redshift

To and Fro from Amazon Redshift

Coursera is an online educational startup with over 19 million learners across the globe. At Coursera we use Redshift as our primary data warehouse as it provides a standard SQL interface and has fast and reliable performance. We use our open-source framework Dataduct to move data to and fro from redshift. In this talk we’ll cover the workflow service at Coursera and how it is now being used for other use cases beyond just ETL such as machine learning, predictions and bulk loading into Cassandra.

0b40b3c621633157be039d55d0fd9ea0?s=128

Sourabh

May 10, 2016
Tweet

Transcript

  1. Sourabh Bajaj, Software Engineer 18 March 2016 To and Fro

    from Amazon Redshift Extending our workflow service for use cases beyond ETL
  2. None
  3. O U R M I S S I O N

    : Universal access to the world’s best education
  4. 3.6M course completions 1,000+ active courses 140 partner institutions 19M

    registered learners By the numbers
  5. About me • Georgia Tech, CS • Analytics at Coursera

    • @sb2nov Machine Learning Distributed Systems Teaching and Learning Computational Finance
  6. • Dataduct • 3Cs • Beyond ETL Agenda

  7. Dataduct 1. Wrapper on AWS Data Pipeline 2. Built in

    house at Coursera 3. github.com/coursera/dataduct
  8. Dataduct

  9. Dataduct #1: Fault Tolerance Recover from machine or transient network

    failures.
  10. Dataduct #1: Fault Tolerance Recover from machine or transient network

    failures. #2: Scheduling Ability to run at different frequencies
  11. Dataduct #1: Fault Tolerance Recover from machine or transient network

    failures. #2: Scheduling Ability to run at different frequencies #3: Resource Management Manage EC2 / EMR resources required
  12. Dataduct #4: Dependency Management Manage dependencies on other pipelines and

    steps
  13. Dataduct #4: Dependency Management Manage dependencies on other pipelines and

    steps #5: Extensibility Easy to run new types of jobs and steps
  14. Dataduct #4: Dependency Management Manage dependencies on other pipelines and

    steps #5: Extensibility Easy to run new types of jobs and steps #6: Developer Friendliness Easy development and deployment
  15. 3 Cs Collecting Curating Capitalizing

  16. Case Study: Instructor Dashboards

  17. Case study: Instructor dashboards Student demographics

  18. Student demographics Identify learner misconceptions Case study: Instructor dashboards

  19. Student demographics Highlight learner dropout points Identify learner misconceptions Case

    study: Instructor dashboards
  20. Student demographics Highlight learner dropout points Identify learner misconceptions Case

    study: Instructor dashboards
  21. 3 Cs Collecting Curating Capitalizing

  22. Collecting Build systems that make it easy to collect data.

    Collecting Curating Capitalizing
  23. Collecting services Collecting Curating Capitalizing web mobile Eventing service Redshift

  24. Netflix Aegisthus Collecting Collecting Curating Capitalizing Dataduct + Redshift Collecting

  25. Collecting Collecting Curating Capitalizing def processBranches(branchPipe: TypedPipe[BranchModel], outputPath: String): Unit

    = { branchPipe .map { branch => (StringKey(branch.branchId).key, StringKey(branch.courseId).key, branch.changesDescription.map(_.value).getOrElse("")) } .write(TypedTsv[COURSE_BRANCHES_OUTPUT_FORMAT](outputPath))} Collecting
  26. Collecting • Definition in YAML • Steps • Visualization •

    Reusable code Collecting Curating Capitalizing steps: - type: extract-from-rds sql: | SELECT instructor_id, ,course_id ,rank FROM courses_instructorincourse; hostname: host_db_1 database: master - type: load-into-staging-table table: staging.instructors_sessions - type: reload-prod-table source: staging.instructors_sessions destination: prod.instructors_sessions
  27. Case study: Instructor dashboards Collecting Curating Capitalizing Eventing data (Student

    progress) Cassandra data (Course content) Redshift Raw tables Learner
  28. 3 Cs Collecting Curating Capitalizing

  29. Curating Data quality Collecting Curating Capitalizing 1. Correctness 2. Completeness

    3. Interpretability
  30. Curating Standardize the business definitions Collecting Curating Capitalizing BI schema

    raw schema analyses
  31. Curating Standardization Collecting Curating Capitalizing Steps: - step_type: pipeline-dependencies name:

    wait_for_dependencies dependent_pipelines: - raw_events - recommendations - step_type: create-update-sql name: discovery_clicks depends_on: wait_for_dependencies script: scripts/bi.discovery_clicks.sql table_definition: bi.discovery_clicks.sql - step_type: create-update-sql name: discovery_impressions depends_on: wait_for_dependencies script: scripts/bi.discovery_impressions.sql table_definition:bi.discovery_impressions.sql
  32. Case study: Instructor dashboards Collecting Curating Capitalizing Redshift BI tables

    Cumulative progress per student & course Dataduct Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner
  33. 3 Cs Collecting Curating Capitalizing

  34. Capitalizing Building data products within production ecosystem Collecting Curating Capitalizing

    Nostos service Key/Value access Redshift Nostos Loader
  35. Capitalizing Building data products within production ecosystem Collecting Curating Capitalizing

  36. Capitalizing Nostos Collecting Curating Capitalizing steps: - step_type: nostos-v2 job_name:

    example index_column: user_id fields: - generator: sql_iterator sql: > SELECT user_id ,session_id::VARCHAR AS entityId1 FROM prod.enrollments WHERE user_id < 100 - generator: sql_iterator is_set: true sql: > SELECT user_id ,session_id::VARCHAR AS entityId2 FROM prod.enrollments WHERE user_id < 100
  37. Case study: Instructor dashboards Collecting Curating Capitalizing Materialized progress per

    course (KVS) Nostos Instructor Redshift BI tables Cumulative progress per student & course Dataduct Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner
  38. Beyond ETL Why should we do this ? 1. Leveraging

    current infrastructure accelerates the team 2. New use cases keep coming up that have helped evolve dataduct
  39. Machine Learning • Daily model updates • Multistage model training

    • Hyperparameter tuning • Performance benchmarks
  40. Experimentation • Hourly result calculations • Parameter updates

  41. Takeaways: • Leveraging common infrastructure can really accelerate the team

    • Good Infrastructure would be used in really creative ways which you might not anticipate
  42. Thank You

  43. Questions ?

  44. coursera.org/jobs building.coursera.org @CourseraEng