Upgrade to Pro — share decks privately, control downloads, hide ads and more …

To and Fro from Amazon Redshift

To and Fro from Amazon Redshift

Coursera is an online educational startup with over 19 million learners across the globe. At Coursera we use Redshift as our primary data warehouse as it provides a standard SQL interface and has fast and reliable performance. We use our open-source framework Dataduct to move data to and fro from redshift. In this talk we’ll cover the workflow service at Coursera and how it is now being used for other use cases beyond just ETL such as machine learning, predictions and bulk loading into Cassandra.

Sourabh

May 10, 2016
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. Sourabh Bajaj, Software Engineer
    18 March 2016
    To and Fro from Amazon Redshift
    Extending our workflow service for use cases beyond ETL

    View full-size slide

  2. O U R M I S S I O N :
    Universal access to
    the world’s best education

    View full-size slide

  3. 3.6M
    course
    completions
    1,000+
    active
    courses
    140
    partner
    institutions
    19M
    registered
    learners
    By the numbers

    View full-size slide

  4. About me
    ● Georgia Tech, CS
    ● Analytics at Coursera
    ● @sb2nov
    Machine Learning
    Distributed Systems
    Teaching and Learning
    Computational Finance

    View full-size slide

  5. ● Dataduct
    ● 3Cs
    ● Beyond ETL
    Agenda

    View full-size slide

  6. Dataduct
    1. Wrapper on AWS Data Pipeline
    2. Built in house at Coursera
    3. github.com/coursera/dataduct

    View full-size slide

  7. Dataduct
    #1: Fault Tolerance
    Recover from machine or transient network failures.

    View full-size slide

  8. Dataduct
    #1: Fault Tolerance
    Recover from machine or transient network failures.
    #2: Scheduling
    Ability to run at different frequencies

    View full-size slide

  9. Dataduct
    #1: Fault Tolerance
    Recover from machine or transient network failures.
    #2: Scheduling
    Ability to run at different frequencies
    #3: Resource Management
    Manage EC2 / EMR resources required

    View full-size slide

  10. Dataduct
    #4: Dependency Management
    Manage dependencies on other pipelines and steps

    View full-size slide

  11. Dataduct
    #4: Dependency Management
    Manage dependencies on other pipelines and steps
    #5: Extensibility
    Easy to run new types of jobs and steps

    View full-size slide

  12. Dataduct
    #4: Dependency Management
    Manage dependencies on other pipelines and steps
    #5: Extensibility
    Easy to run new types of jobs and steps
    #6: Developer Friendliness
    Easy development and deployment

    View full-size slide

  13. 3 Cs
    Collecting Curating Capitalizing

    View full-size slide

  14. Case Study: Instructor Dashboards

    View full-size slide

  15. Case study:
    Instructor dashboards
    Student demographics

    View full-size slide

  16. Student demographics
    Identify learner misconceptions
    Case study:
    Instructor dashboards

    View full-size slide

  17. Student demographics
    Highlight learner dropout points
    Identify learner misconceptions
    Case study:
    Instructor dashboards

    View full-size slide

  18. Student demographics
    Highlight learner dropout points
    Identify learner misconceptions
    Case study:
    Instructor dashboards

    View full-size slide

  19. 3 Cs
    Collecting Curating Capitalizing

    View full-size slide

  20. Collecting
    Build systems that make it easy to collect data.
    Collecting Curating Capitalizing

    View full-size slide

  21. Collecting
    services
    Collecting Curating Capitalizing
    web
    mobile
    Eventing
    service
    Redshift

    View full-size slide

  22. Netflix
    Aegisthus
    Collecting
    Collecting Curating Capitalizing
    Dataduct +
    Redshift
    Collecting

    View full-size slide

  23. Collecting
    Collecting Curating Capitalizing
    def processBranches(branchPipe: TypedPipe[BranchModel], outputPath: String): Unit = {
    branchPipe
    .map { branch =>
    (StringKey(branch.branchId).key,
    StringKey(branch.courseId).key,
    branch.changesDescription.map(_.value).getOrElse(""))
    }
    .write(TypedTsv[COURSE_BRANCHES_OUTPUT_FORMAT](outputPath))}
    Collecting

    View full-size slide

  24. Collecting
    ● Definition in YAML
    ● Steps
    ● Visualization
    ● Reusable code
    Collecting Curating Capitalizing
    steps:
    - type: extract-from-rds
    sql: | SELECT instructor_id,
    ,course_id
    ,rank
    FROM courses_instructorincourse;
    hostname: host_db_1
    database: master
    - type: load-into-staging-table
    table: staging.instructors_sessions
    - type: reload-prod-table
    source: staging.instructors_sessions
    destination: prod.instructors_sessions

    View full-size slide

  25. Case study: Instructor dashboards
    Collecting Curating Capitalizing
    Eventing data
    (Student progress)
    Cassandra data
    (Course content)
    Redshift
    Raw tables
    Learner

    View full-size slide

  26. 3 Cs
    Collecting Curating Capitalizing

    View full-size slide

  27. Curating
    Data quality
    Collecting Curating Capitalizing
    1. Correctness
    2. Completeness
    3. Interpretability

    View full-size slide

  28. Curating
    Standardize the business definitions
    Collecting Curating Capitalizing
    BI
    schema
    raw
    schema
    analyses

    View full-size slide

  29. Curating
    Standardization
    Collecting Curating Capitalizing
    Steps:
    - step_type: pipeline-dependencies
    name: wait_for_dependencies
    dependent_pipelines:
    - raw_events
    - recommendations
    - step_type: create-update-sql
    name: discovery_clicks
    depends_on: wait_for_dependencies
    script: scripts/bi.discovery_clicks.sql
    table_definition: bi.discovery_clicks.sql
    - step_type: create-update-sql
    name: discovery_impressions
    depends_on: wait_for_dependencies
    script: scripts/bi.discovery_impressions.sql
    table_definition:bi.discovery_impressions.sql

    View full-size slide

  30. Case study: Instructor dashboards
    Collecting Curating Capitalizing
    Redshift
    BI tables
    Cumulative progress
    per student & course
    Dataduct
    Eventing data
    (Student progress)
    Cassandra data
    (Course content)
    Redshift
    Raw tables
    Learner

    View full-size slide

  31. 3 Cs
    Collecting Curating Capitalizing

    View full-size slide

  32. Capitalizing
    Building data products within production ecosystem
    Collecting Curating Capitalizing
    Nostos
    service
    Key/Value access
    Redshift
    Nostos Loader

    View full-size slide

  33. Capitalizing
    Building data products within production ecosystem
    Collecting Curating Capitalizing

    View full-size slide

  34. Capitalizing
    Nostos
    Collecting Curating Capitalizing
    steps:
    - step_type: nostos-v2
    job_name: example
    index_column: user_id
    fields:
    - generator: sql_iterator
    sql: >
    SELECT user_id
    ,session_id::VARCHAR AS entityId1
    FROM prod.enrollments
    WHERE user_id < 100
    - generator: sql_iterator
    is_set: true
    sql: >
    SELECT user_id
    ,session_id::VARCHAR AS entityId2
    FROM prod.enrollments
    WHERE user_id < 100

    View full-size slide

  35. Case study: Instructor dashboards
    Collecting Curating Capitalizing
    Materialized progress
    per course (KVS)
    Nostos
    Instructor
    Redshift
    BI tables
    Cumulative progress
    per student & course
    Dataduct
    Eventing data
    (Student progress)
    Cassandra data
    (Course content)
    Redshift
    Raw tables
    Learner

    View full-size slide

  36. Beyond ETL
    Why should we do this ?
    1. Leveraging current infrastructure
    accelerates the team
    2. New use cases keep coming up that
    have helped evolve dataduct

    View full-size slide

  37. Machine Learning
    ● Daily model updates
    ● Multistage model training
    ● Hyperparameter tuning
    ● Performance benchmarks

    View full-size slide

  38. Experimentation
    ● Hourly result calculations
    ● Parameter updates

    View full-size slide

  39. Takeaways:
    ● Leveraging common infrastructure can really
    accelerate the team
    ● Good Infrastructure would be used in really
    creative ways which you might not anticipate

    View full-size slide

  40. coursera.org/jobs
    building.coursera.org
    @CourseraEng

    View full-size slide