Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

To and Fro from Amazon Redshift

To and Fro from Amazon Redshift

Coursera is an online educational startup with over 19 million learners across the globe. At Coursera we use Redshift as our primary data warehouse as it provides a standard SQL interface and has fast and reliable performance. We use our open-source framework Dataduct to move data to and fro from redshift. In this talk we’ll cover the workflow service at Coursera and how it is now being used for other use cases beyond just ETL such as machine learning, predictions and bulk loading into Cassandra.

Sourabh

May 10, 2016
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. Sourabh Bajaj, Software Engineer 18 March 2016 To and Fro

    from Amazon Redshift Extending our workflow service for use cases beyond ETL
  2. O U R M I S S I O N

    : Universal access to the world’s best education
  3. About me • Georgia Tech, CS • Analytics at Coursera

    • @sb2nov Machine Learning Distributed Systems Teaching and Learning Computational Finance
  4. Dataduct 1. Wrapper on AWS Data Pipeline 2. Built in

    house at Coursera 3. github.com/coursera/dataduct
  5. Dataduct #1: Fault Tolerance Recover from machine or transient network

    failures. #2: Scheduling Ability to run at different frequencies
  6. Dataduct #1: Fault Tolerance Recover from machine or transient network

    failures. #2: Scheduling Ability to run at different frequencies #3: Resource Management Manage EC2 / EMR resources required
  7. Dataduct #4: Dependency Management Manage dependencies on other pipelines and

    steps #5: Extensibility Easy to run new types of jobs and steps
  8. Dataduct #4: Dependency Management Manage dependencies on other pipelines and

    steps #5: Extensibility Easy to run new types of jobs and steps #6: Developer Friendliness Easy development and deployment
  9. Collecting Collecting Curating Capitalizing def processBranches(branchPipe: TypedPipe[BranchModel], outputPath: String): Unit

    = { branchPipe .map { branch => (StringKey(branch.branchId).key, StringKey(branch.courseId).key, branch.changesDescription.map(_.value).getOrElse("")) } .write(TypedTsv[COURSE_BRANCHES_OUTPUT_FORMAT](outputPath))} Collecting
  10. Collecting • Definition in YAML • Steps • Visualization •

    Reusable code Collecting Curating Capitalizing steps: - type: extract-from-rds sql: | SELECT instructor_id, ,course_id ,rank FROM courses_instructorincourse; hostname: host_db_1 database: master - type: load-into-staging-table table: staging.instructors_sessions - type: reload-prod-table source: staging.instructors_sessions destination: prod.instructors_sessions
  11. Case study: Instructor dashboards Collecting Curating Capitalizing Eventing data (Student

    progress) Cassandra data (Course content) Redshift Raw tables Learner
  12. Curating Standardization Collecting Curating Capitalizing Steps: - step_type: pipeline-dependencies name:

    wait_for_dependencies dependent_pipelines: - raw_events - recommendations - step_type: create-update-sql name: discovery_clicks depends_on: wait_for_dependencies script: scripts/bi.discovery_clicks.sql table_definition: bi.discovery_clicks.sql - step_type: create-update-sql name: discovery_impressions depends_on: wait_for_dependencies script: scripts/bi.discovery_impressions.sql table_definition:bi.discovery_impressions.sql
  13. Case study: Instructor dashboards Collecting Curating Capitalizing Redshift BI tables

    Cumulative progress per student & course Dataduct Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner
  14. Capitalizing Nostos Collecting Curating Capitalizing steps: - step_type: nostos-v2 job_name:

    example index_column: user_id fields: - generator: sql_iterator sql: > SELECT user_id ,session_id::VARCHAR AS entityId1 FROM prod.enrollments WHERE user_id < 100 - generator: sql_iterator is_set: true sql: > SELECT user_id ,session_id::VARCHAR AS entityId2 FROM prod.enrollments WHERE user_id < 100
  15. Case study: Instructor dashboards Collecting Curating Capitalizing Materialized progress per

    course (KVS) Nostos Instructor Redshift BI tables Cumulative progress per student & course Dataduct Eventing data (Student progress) Cassandra data (Course content) Redshift Raw tables Learner
  16. Beyond ETL Why should we do this ? 1. Leveraging

    current infrastructure accelerates the team 2. New use cases keep coming up that have helped evolve dataduct
  17. Machine Learning • Daily model updates • Multistage model training

    • Hyperparameter tuning • Performance benchmarks
  18. Takeaways: • Leveraging common infrastructure can really accelerate the team

    • Good Infrastructure would be used in really creative ways which you might not anticipate