Cassandra Batch loading for Data Products

Sourabh Bajaj, Software Engineer 14 July, 2016 C* Batch loading
for Data Products

About me • Georgia Tech, CS • Analytics at Coursera
• @sb2nov Machine Learning Distributed Systems

• Motivation • Architecture • Data Products Agenda

Motivation ▪ Bridging the Production Gap ▫ Decoupling data scientists
to push results to production ▫ Easy to publish computed results online ▪ Programmatic access to data generated offline Analyst Model iteration Product Teams ? Product iteration

▪ Moving data from Redshift to Cassandra ▪ Few use
cases in production: ▫ Recommendations features ▫ Search ranking features ▫ Language localization ▫ Featured lists on Homepage ▫ Instructor dashboards Motivation

Architecture Nostos service Key/Value Redshift Nostos Loader API Product backend

Architecture

▪ Integrated with internal workflow manager dataduct for scheduling jobs
and managing resources ▪ The offline jobs are specified as YAML files and use a set of built-in generators to create valid json data ▪ CQLSSTableWriter to create SSTables from the output of various generators ▪ Load them onto cluster using sstableloader Architecture

▪ Every job run is stored as a new table
job_name_version ▫ Once data loading is complete, it is immutable ▪ Thin rest layer that sits on top of cassandra for other services to fetch data for particular keys ▪ The scala client only fetches the fields that are needed Architecture

▪ Last 3 versions of a data are kept in
the cluster to allow for rollbacks ▪ Compactions are not necessary as each key should only appear in one SSTable ▪ Once a new version of the data is available the loader will drop the oldest version ▪ Internal dashboard for exploring data, rollbacks etc. Architecture

Architecture

▪ User Profile ▫ Internal job that aggregates all the
features used in online prediction about a user in one place ▫ Store inferred/computed features from multiple data stores ▫ What languages does this user speak ? ▸ Subtitles, Browser, Enrollments ▫ Affinity to particular domains? ▸ How interested are you in data science classes? ▫ These are then used in ranking of your recommendations and search results. Data Products

Wins / Problems ▪ Nostos has really helped us in
iterating faster in building data products as features are shared and available easily. ▪ It is fairly easy to use CQLSSTableWriter and SSTableLoader ▪ Use composite keys to allow accessing only particular fields ▪ Latency / staleness downside as you cannot load data in realtime ▪ Need to clear snapshots from each node after dropping tables

Thank You Questions?

coursera.org/jobs building.coursera.org @CourseraEng

Cassandra Batch loading for Data Products

Cassandra Batch loading for Data Products

Sourabh

More Decks by Sourabh

Other Decks in Programming

Featured

Transcript

Sourabh Bajaj, Software Engineer 14 July, 2016 C* Batch loading

About me • Georgia Tech, CS • Analytics at Coursera

• Motivation • Architecture • Data Products Agenda

Motivation ▪ Bridging the Production Gap ▫ Decoupling data scientists

▪ Moving data from Redshift to Cassandra ▪ Few use

Architecture Nostos service Key/Value Redshift Nostos Loader API Product backend

Architecture

▪ Integrated with internal workflow manager dataduct for scheduling jobs

▪ Every job run is stored as a new table

▪ Last 3 versions of a data are kept in

Architecture

▪ User Profile ▫ Internal job that aggregates all the

Wins / Problems ▪ Nostos has really helped us in

Thank You Questions?

coursera.org/jobs building.coursera.org @CourseraEng