Cassandra Batch loading for Data Products

Slide 1

Slide 1 text

Sourabh Bajaj, Software Engineer 14 July, 2016 C* Batch loading for Data Products

Slide 2

Slide 2 text

About me ● Georgia Tech, CS ● Analytics at Coursera ● @sb2nov Machine Learning Distributed Systems

Slide 3

Slide 3 text

● Motivation ● Architecture ● Data Products Agenda

Slide 4

Slide 4 text

Motivation ▪ Bridging the Production Gap ▫ Decoupling data scientists to push results to production ▫ Easy to publish computed results online ▪ Programmatic access to data generated offline Analyst Model iteration Product Teams ? Product iteration

Slide 5

Slide 5 text

▪ Moving data from Redshift to Cassandra ▪ Few use cases in production: ▫ Recommendations features ▫ Search ranking features ▫ Language localization ▫ Featured lists on Homepage ▫ Instructor dashboards Motivation

Slide 6

Slide 6 text

Architecture Nostos service Key/Value Redshift Nostos Loader API Product backend

Slide 7

Slide 7 text

Architecture

Slide 8

Slide 8 text

▪ Integrated with internal workflow manager dataduct for scheduling jobs and managing resources ▪ The offline jobs are specified as YAML files and use a set of built-in generators to create valid json data ▪ CQLSSTableWriter to create SSTables from the output of various generators ▪ Load them onto cluster using sstableloader Architecture

Slide 9

Slide 9 text

▪ Every job run is stored as a new table job_name_version ▫ Once data loading is complete, it is immutable ▪ Thin rest layer that sits on top of cassandra for other services to fetch data for particular keys ▪ The scala client only fetches the fields that are needed Architecture

Slide 10

Slide 10 text

▪ Last 3 versions of a data are kept in the cluster to allow for rollbacks ▪ Compactions are not necessary as each key should only appear in one SSTable ▪ Once a new version of the data is available the loader will drop the oldest version ▪ Internal dashboard for exploring data, rollbacks etc. Architecture

Slide 11

Slide 11 text

Architecture

Slide 12

Slide 12 text

▪ User Profile ▫ Internal job that aggregates all the features used in online prediction about a user in one place ▫ Store inferred/computed features from multiple data stores ▫ What languages does this user speak ? ▸ Subtitles, Browser, Enrollments ▫ Affinity to particular domains? ▸ How interested are you in data science classes? ▫ These are then used in ranking of your recommendations and search results. Data Products

Slide 13

Slide 13 text

Wins / Problems ▪ Nostos has really helped us in iterating faster in building data products as features are shared and available easily. ▪ It is fairly easy to use CQLSSTableWriter and SSTableLoader ▪ Use composite keys to allow accessing only particular fields ▪ Latency / staleness downside as you cannot load data in realtime ▪ Need to clear snapshots from each node after dropping tables

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text