Slide 1

Slide 1 text

Sourabh Bajaj, Software Engineer 14 July, 2016 C* Batch loading for Data Products

Slide 2

Slide 2 text

About me ● Georgia Tech, CS ● Analytics at Coursera ● @sb2nov Machine Learning Distributed Systems

Slide 3

Slide 3 text

● Motivation ● Architecture ● Data Products Agenda

Slide 4

Slide 4 text

Motivation ▪ Bridging the Production Gap ▫ Decoupling data scientists to push results to production ▫ Easy to publish computed results online ▪ Programmatic access to data generated offline Analyst Model iteration Product Teams ? Product iteration

Slide 5

Slide 5 text

▪ Moving data from Redshift to Cassandra ▪ Few use cases in production: ▫ Recommendations features ▫ Search ranking features ▫ Language localization ▫ Featured lists on Homepage ▫ Instructor dashboards Motivation

Slide 6

Slide 6 text

Architecture Nostos service Key/Value Redshift Nostos Loader API Product backend

Slide 7

Slide 7 text

Architecture

Slide 8

Slide 8 text

▪ Integrated with internal workflow manager dataduct for scheduling jobs and managing resources ▪ The offline jobs are specified as YAML files and use a set of built-in generators to create valid json data ▪ CQLSSTableWriter to create SSTables from the output of various generators ▪ Load them onto cluster using sstableloader Architecture

Slide 9

Slide 9 text

▪ Every job run is stored as a new table job_name_version ▫ Once data loading is complete, it is immutable ▪ Thin rest layer that sits on top of cassandra for other services to fetch data for particular keys ▪ The scala client only fetches the fields that are needed Architecture

Slide 10

Slide 10 text

▪ Last 3 versions of a data are kept in the cluster to allow for rollbacks ▪ Compactions are not necessary as each key should only appear in one SSTable ▪ Once a new version of the data is available the loader will drop the oldest version ▪ Internal dashboard for exploring data, rollbacks etc. Architecture

Slide 11

Slide 11 text

Architecture

Slide 12

Slide 12 text

▪ User Profile ▫ Internal job that aggregates all the features used in online prediction about a user in one place ▫ Store inferred/computed features from multiple data stores ▫ What languages does this user speak ? ▸ Subtitles, Browser, Enrollments ▫ Affinity to particular domains? ▸ How interested are you in data science classes? ▫ These are then used in ranking of your recommendations and search results. Data Products

Slide 13

Slide 13 text

Wins / Problems ▪ Nostos has really helped us in iterating faster in building data products as features are shared and available easily. ▪ It is fairly easy to use CQLSSTableWriter and SSTableLoader ▪ Use composite keys to allow accessing only particular fields ▪ Latency / staleness downside as you cannot load data in realtime ▪ Need to clear snapshots from each node after dropping tables

Slide 14

Slide 14 text

Thank You Questions?

Slide 15

Slide 15 text

coursera.org/jobs building.coursera.org @CourseraEng