$30 off During Our Annual Pro Sale. View Details »

Cassandra Batch loading for Data Products

Sourabh
July 14, 2016

Cassandra Batch loading for Data Products

Batch loading in Cassandra can be used for cases such as data migrations between different systems etc. Sourabh (software engineer at Coursera) will talk about Nostos (batch loading service at Coursera) and some of its use cases in data products such as recommendations, search and prediction models. We will cover some of the design choice and tradeoffs made in building Nostos and see how the system has evolved over the past one year.

Sourabh

July 14, 2016
Tweet

More Decks by Sourabh

Other Decks in Programming

Transcript

  1. Sourabh Bajaj, Software Engineer
    14 July, 2016
    C* Batch loading for Data Products

    View Slide

  2. About me
    ● Georgia Tech, CS
    ● Analytics at Coursera
    ● @sb2nov
    Machine Learning
    Distributed Systems

    View Slide

  3. ● Motivation
    ● Architecture
    ● Data Products
    Agenda

    View Slide

  4. Motivation
    ▪ Bridging the Production Gap
    ▫ Decoupling data scientists to push results to production
    ▫ Easy to publish computed results online
    ▪ Programmatic access to data generated offline
    Analyst
    Model
    iteration
    Product Teams
    ?
    Product
    iteration

    View Slide

  5. ▪ Moving data from Redshift to Cassandra
    ▪ Few use cases in production:
    ▫ Recommendations features
    ▫ Search ranking features
    ▫ Language localization
    ▫ Featured lists on Homepage
    ▫ Instructor dashboards
    Motivation

    View Slide

  6. Architecture
    Nostos
    service
    Key/Value
    Redshift
    Nostos
    Loader API
    Product
    backend

    View Slide

  7. Architecture

    View Slide

  8. ▪ Integrated with internal workflow manager dataduct for
    scheduling jobs and managing resources
    ▪ The offline jobs are specified as YAML files and use a set of
    built-in generators to create valid json data
    ▪ CQLSSTableWriter to create SSTables from the output of
    various generators
    ▪ Load them onto cluster using sstableloader
    Architecture

    View Slide

  9. ▪ Every job run is stored as a new table job_name_version
    ▫ Once data loading is complete, it is immutable
    ▪ Thin rest layer that sits on top of cassandra for other services
    to fetch data for particular keys
    ▪ The scala client only fetches the fields that are needed
    Architecture

    View Slide

  10. ▪ Last 3 versions of a data are kept in the cluster to allow for
    rollbacks
    ▪ Compactions are not necessary as each key should only
    appear in one SSTable
    ▪ Once a new version of the data is available the loader will
    drop the oldest version
    ▪ Internal dashboard for exploring data, rollbacks etc.
    Architecture

    View Slide

  11. Architecture

    View Slide

  12. ▪ User Profile
    ▫ Internal job that aggregates all the features used in online
    prediction about a user in one place
    ▫ Store inferred/computed features from multiple data stores
    ▫ What languages does this user speak ?
    ▸ Subtitles, Browser, Enrollments
    ▫ Affinity to particular domains?
    ▸ How interested are you in data science classes?
    ▫ These are then used in ranking of your recommendations and
    search results.
    Data Products

    View Slide

  13. Wins / Problems
    ▪ Nostos has really helped us in iterating faster in building data
    products as features are shared and available easily.
    ▪ It is fairly easy to use CQLSSTableWriter and SSTableLoader
    ▪ Use composite keys to allow accessing only particular fields
    ▪ Latency / staleness downside as you cannot load data in
    realtime
    ▪ Need to clear snapshots from each node after dropping
    tables

    View Slide

  14. Thank You
    Questions?

    View Slide

  15. coursera.org/jobs
    building.coursera.org
    @CourseraEng

    View Slide