Cassandra Batch loading for Data Products

Cassandra Batch loading for Data Products

Batch loading in Cassandra can be used for cases such as data migrations between different systems etc. Sourabh (software engineer at Coursera) will talk about Nostos (batch loading service at Coursera) and some of its use cases in data products such as recommendations, search and prediction models. We will cover some of the design choice and tradeoffs made in building Nostos and see how the system has evolved over the past one year.

0b40b3c621633157be039d55d0fd9ea0?s=128

Sourabh

July 14, 2016
Tweet

Transcript

  1. Sourabh Bajaj, Software Engineer 14 July, 2016 C* Batch loading

    for Data Products
  2. About me • Georgia Tech, CS • Analytics at Coursera

    • @sb2nov Machine Learning Distributed Systems
  3. • Motivation • Architecture • Data Products Agenda

  4. Motivation ▪ Bridging the Production Gap ▫ Decoupling data scientists

    to push results to production ▫ Easy to publish computed results online ▪ Programmatic access to data generated offline Analyst Model iteration Product Teams ? Product iteration
  5. ▪ Moving data from Redshift to Cassandra ▪ Few use

    cases in production: ▫ Recommendations features ▫ Search ranking features ▫ Language localization ▫ Featured lists on Homepage ▫ Instructor dashboards Motivation
  6. Architecture Nostos service Key/Value Redshift Nostos Loader API Product backend

  7. Architecture

  8. ▪ Integrated with internal workflow manager dataduct for scheduling jobs

    and managing resources ▪ The offline jobs are specified as YAML files and use a set of built-in generators to create valid json data ▪ CQLSSTableWriter to create SSTables from the output of various generators ▪ Load them onto cluster using sstableloader Architecture
  9. ▪ Every job run is stored as a new table

    job_name_version ▫ Once data loading is complete, it is immutable ▪ Thin rest layer that sits on top of cassandra for other services to fetch data for particular keys ▪ The scala client only fetches the fields that are needed Architecture
  10. ▪ Last 3 versions of a data are kept in

    the cluster to allow for rollbacks ▪ Compactions are not necessary as each key should only appear in one SSTable ▪ Once a new version of the data is available the loader will drop the oldest version ▪ Internal dashboard for exploring data, rollbacks etc. Architecture
  11. Architecture

  12. ▪ User Profile ▫ Internal job that aggregates all the

    features used in online prediction about a user in one place ▫ Store inferred/computed features from multiple data stores ▫ What languages does this user speak ? ▸ Subtitles, Browser, Enrollments ▫ Affinity to particular domains? ▸ How interested are you in data science classes? ▫ These are then used in ranking of your recommendations and search results. Data Products
  13. Wins / Problems ▪ Nostos has really helped us in

    iterating faster in building data products as features are shared and available easily. ▪ It is fairly easy to use CQLSSTableWriter and SSTableLoader ▪ Use composite keys to allow accessing only particular fields ▪ Latency / staleness downside as you cannot load data in realtime ▪ Need to clear snapshots from each node after dropping tables
  14. Thank You Questions?

  15. coursera.org/jobs building.coursera.org @CourseraEng