In-Memory, Component-Based Recommender Architecture

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

In-Memory, Component-Based Recommender Architecture June 25th, 2019

Slide 3

Slide 3 text

Who are We? ● E-commerce founded in late 2014 ○ Internal Engineering founded in early 2015 ○ Launched our in-house website in mid-2015, app in late-2015 ● Concentrated on women's fashion ○ Around Rp 100K range as opposed to > Rp 200K range ○ One of the top fashion e-commerce in Indonesia ● Recently rebranded to Sorabel in January 2019

Slide 4

Slide 4 text

Google Play App Rankings (5th)

Slide 5

Slide 5 text

● Previous Recommender Development Workflow ○ Exploration, Computation, Serving ● New Recommender Development Workflow ○ Exploration, Computation, Serving ○ The Impact ● In-Memory Recommender Architecture ○ In-Depth Architecture ○ Pros & Cons Talk Overview

Slide 6

Slide 6 text

● Software that tries to optimize future actions of its users ○ Personalized: based on a specific user’s past behavior ○ Non-personalized: based on the aggregate non-user contextual model What is a Recommender?

Slide 7

Slide 7 text

● We have recommenders for many things: ○ For our buyers: Restock, Budget, Trend Recommenders ○ For our warehouse: Order-Item Picking Route Optimizer, Item Placement Optimizer, Logistics Partner Order Allocator ○ For our customers: Product Recommender ● In this talk, we are focused to Product Recommender Recommenders in Sorabel

Slide 8

Slide 8 text

● Home Feed ● Catalogue Feed ● Similar Products ● Search, and others! Uses of Product Recommender

Slide 9

Slide 9 text

● Exploration ● Computation ● Serving Development Workﬂow

Slide 10

Slide 10 text

● Exploration ○ Data scientists sift through the data to derive deeper insights by looking beyond the basic metrics ● Computation ○ Which classes of algorithm are computationally feasible, and how each model should be built, fit, validated, and then recomputed regularly ● Serving ○ How models are ultimately stored post-computation and served during production to our millions of users Development Workﬂow

Slide 11

Slide 11 text

Previous Development Workﬂow

Slide 12

Slide 12 text

Previous Recommender Architecture ● BigQuery: data warehouse, from multiple data source -- MySQL, Cassandra, Kafka ● Apache Spark: Engine for large-scale data processing ● Dataproc: running Apache Spark jobs inside the GCP ● ElasticSearch: An open-source, distributed search engine

Slide 13

Slide 13 text

● Data scientists explore, analyze the data, try to build the right model / heuristics ○ Most of their explorations are done in Python ● This is usually done locally → limited hardware resource ○ Inherently lower limit to the size of data during experimentation → data scientists are then limited to “toy” datasets during this stage ○ Harder for data scientists to collaborate Exploration

Slide 14

Slide 14 text

● Data engineers translate data scientists’ works into appropriate Spark jobs ○ Computation was mostly done inside Google Dataproc ● Data engineers will make necessary changes so the model is ready for production ○ For example, dummy dataset vs production-scale dataset ○ Long back-and-forth feedback loop between data scientists & engineers ● Recommendations were largely precomputed at less-than-optimal scope: at the feed level, done daily ○ Computation and disk-writes take a long time (+ storage costs!) ○ Low usage rate → not everyone will visit their precomputed feed on a daily basis Computation

Slide 15

Slide 15 text

● Production read-path that serves the actual models / recommendations to our users ● A dual-layer architecture: ○ Highly-stateful data storage layer -- ElasticSearch ○ Stateless (horizontally-scaled) REST API servers, that mostly read from the stateful layer with minimal post-processing ● Implemented and maintained by backend engineers Serving

Slide 16

Slide 16 text

● Exploration is usually done locally ○ Local resource is limited ○ They usually play with tiny subsets of data to get the work done locally ○ Harder for data scientists to collaborate ● Going back and forth between data scientists and data engineers took longer than it should be ● Long indexing time (daily job took ~4-8 hrs) ● Non-trivial cost and complexity in the serving infra (Dataproc + ElasticSearch) Recap: Previous Workﬂow Problems

Slide 17

Slide 17 text

New Development Workﬂow

Slide 18

Slide 18 text

● Data Scientists can utilize Sciencebox ○ Jupyterlab running in dedicated containers on top of Kubernetes for each data scientist ○ Instant access to powerful resources ■ Large core count & RAM ■ GPU if needed ● No longer need to play around with tiny subset of data ● Easier to collaborate, share, and evaluate works between data scientists Exploration

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

● Introducing DataQuery ○ A platform where anyone can build their own derived tables easily ○ Derived table is composite-data tables -- tables whose data are composed from multiple tables ■ From raw tables / other derived tables ■ Mostly defined by a SQL query ■ Editable data refresh frequency ○ Built on top of Google BigQuery Computation

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

● Frequently, simpler model can be realized by using just DataQuery ○ No need for any Spark jobs for most of the cases ○ Data Scientists can do this independently Computation

Slide 23

Slide 23 text

● Serving infrastructure is now a single-layer, Go-based in-memory service (IM for short) ○ We load the “models” from DataQuery (or any other data) into the service’s resident memory as “Components”, conditionally at startup or as-needed ○ Components are built on top of each other to build a more complete and capable “component tree” that then serves the actual recommendations as a group Serving

Slide 24

Slide 24 text

● Serving infrastructure is now a single-layer, Go-based in-memory service (IM for short), ○ Additional computations (including but not limited to inter-model component stitching, data re-sorting, realtime-data-sensitive logic, etc.) can be done within the Components on-request, on-the-fly ○ Centralized “component registry” handles caching / re-computations of diﬀerent parts of the component-tree for best performance with little manual work, not dissimilar to React’s Virtual DOM concept used in user interfaces ○ A much larger chunk of the user-specific recommendation computation can now be done on-the-fly, only when the user comes to the site Serving

Slide 25

Slide 25 text

● Backend engineers implement the components ○ However, due to its simplicity, data scientists often implement components by themselves ○ Data scientist’s workflow on a new feature is now then very simple: i. Play around with algorithms & data at Sciencebox ii. Write the production version of the algorithm as a DataQuery table iii. “Wrap” the DataQuery model in an IM `Component` ○ We’ll talk about how this works more in-depth later on Serving

Slide 26

Slide 26 text

Workﬂow Comparison Exploration Computation Serving Data Scientists Data Engineers/ Backend Engineers Exploration Computation Serving Previous Workflow: New Workflow:

Slide 27

Slide 27 text

Architecture Comparison Previous Architecture: New Architecture:

Slide 28

Slide 28 text

In-Memory Recommender In-Depth

Slide 29

Slide 29 text

● Data Component ● Registry ● Cache ● Endpoint ● All within a single service The IM Architecture

Slide 30

Slide 30 text

The IM Architecture

Slide 31

Slide 31 text

Data Component ● Responsible for processing data on-the-fly ● May depend on external data source (BigQuery, etc.) or other Data Components (or both) ● The resulting data can be used directly by other components

Slide 32

Slide 32 text

Data Component ● CachePolicy defines the caching configuration -- more on this later ● GetKey serializes the args into a string -- for cache key ● GetData processes data with given args ○ Fetcher is an utility interface to interact with other components and data sources

Slide 33

Slide 33 text

Data Component: Example

Slide 34

Slide 34 text

Data Component: Example

Slide 35

Slide 35 text

● Registry is the central manager that is responsible in making sure Components are computed, cached, and pre-computed at the right times. ● It handles data caching, depending on the component’s CachePolicy ○ Uses LRU (least recently used) eviction policy ● All fetcher.Fetch (component sub-dependency) calls go through the registry ○ It checks the cache with the key returned by Component’s GetKey method ○ If exists, it returns the cached data ○ Otherwise, it calls the component’s GetData method and update the cache Registry

Slide 36

Slide 36 text

● Expiration: data TTL ● MaxEntrySize: maximum number of cached data entries ● NoCache: GetData will be called at all times ● MustPresent: determines the cache type Cache Policy

Slide 37

Slide 37 text

● Persistent Cache: data must be present at all times ○ Critical data components are of this type, i.e. the service cannot run without those data. Example: ProductMapComponent ○ They are initially fetched during startup ○ If the cached data is expired, the registry returns the expired data, while fetching the new one in the background ● Volatile Cache: it is okay when the data is expired ○ Most of them are components that depend to other components Cache Types

Slide 38

Slide 38 text

● A cache stampede is a type of cascading failure that can occur when massively parallel computing systems with caching mechanisms come under very high load. ● Example: A sudden jump in traﬀic (e.g. from TVC) to a page will end up in mostly identical data component request tree -- wasting CPU cycles and useless downstream requests to service dependencies Cache Stampede Problem

Slide 39

Slide 39 text

● Solution: Make sure only one of the identical GetData requests is executed at the same time ○ Try to acquire lock whenever there’s a cache miss ○ If there’s a failure in acquiring the lock, wait until the lock is released. By then, the data should already be cached Cache Stampede Problem

Slide 40

Slide 40 text

Cache Stampede Problem

Slide 41

Slide 41 text

● We load member-specific data only when that member arrives ● Meanwhile, we scale our service horizontally to multiple servers ○ Since the data is cached locally, the state of each server could be different ● Multiple requests by the same member can end up being served by different server ○ The same computation will be done redundantly across multiple servers ○ Reducing the cache hit-rate Member Affinity Routing

Slide 42

Slide 42 text

● We employ custom header-based routing through Envoy -- microservice proxy ○ Consistent routing through X-SS-Member-ID header ● The same member will always be served via the same server ● Example: ○ A member opens the home page. One of the servers then compute the home feed specific to that member. ○ As the member scrolls down, the next requests will come to the same server. As the result is already cached, it returns almost instantly. Member Aﬃnity Routing

Slide 43

Slide 43 text

Putting It All Together ● When we try to implement some feature, we think of the building blocks ○ What data components do we need? What is the data source? How should the data be cached? ○ What are the dependencies between the components? ■ For most of the cases, it may depend on our existing components

Slide 44

Slide 44 text

Example: Search Endpoint ● Suppose that we want to create a search endpoint: ○ Users can search by keyword ○ Users can also sort the result, by price, popularity, etc.

Slide 45

Slide 45 text

Example: Search Endpoint

Slide 46

Slide 46 text

● MatchedProductsForKeywords component filters the products from Product component, and sort them based on the score it obtained from ProductScore component ● But sort is a CPU-heavy operation. How can we improve? Example: Search Endpoint

Slide 47

Slide 47 text

Example: Search Endpoint

Slide 48

Slide 48 text

● We can introduce an intermediary component: ScoreSortedProducts ● ScoreSortedProducts component pre-sorts the composite catalogue made out of the Product and ProductScore. At runtime, MatchedProductsForKeywords would then only need to filter based on the keyword -- without the need to re-sort for every new keyword, thus saving CPU time. Example: Search Endpoint

Slide 49

Slide 49 text

Putting It All Together

Slide 50

Slide 50 text

Putting It All Together

Slide 51

Slide 51 text

● How if we want to make the search result personalized? Example: Search Endpoint

Slide 52

Slide 52 text

Example: Search Endpoint

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Pros and Cons

Slide 55

Slide 55 text

● It is a simple yet powerful abstraction. Problems can be much simplified when thinking of the building blocks / components that are required to build the recommendation context. ● Programming overhead is further reduced when dependencies can be assumed to be in-memory and precomputed most of the time ○ No need to worry about network hops for every component dependency fetch (i.e. vs an alternative approach where say, components are stored in Redis Cluster) ○ More approachable style of programming for data scientists → more end-to-end ownership by data scientists Pros (#1)

Slide 56

Slide 56 text

● Granular components: ○ Flexible and composable ○ Easier to understand and debug ○ Can divide work across engineers & scientists ○ Smaller unit-of-work → better parallelization ● In short, it boosts engineers’ and data scientists’ productivity Pros (#2)

Slide 57

Slide 57 text

● Reduced cost (~50%) ; reduced the number of Dataproc workers, eliminated ElasticSearch instances ○ Even though the cost of computing instances is increased ● User-specific computations are done only when the user comes -- inherently less wasteful precomputations ● Better runtime performance (in-memory → no network latency) ○ Sub-100ms p95 latency in search and home feed ○ Automatic caching infrastructure makes us more resilient to traﬀic bursts Pros (#3)

Slide 58

Slide 58 text

● Improved business metrics in record time ○ Much faster iteration in experimentation and deployment of new recommendation algorithms ● It is fun! Pros -- the most important

Slide 59

Slide 59 text

● Slow startup time; since we fetch all the required data first ○ Now it takes ~4 minutes for each startup (to load prerequisites data components) ○ We implement local file cache to speed up local development ● More sensitive to memory-leak problems ● It requires a lot of RAM (~4G upon startup, increasing as cached data build up) ○ May pose a problem when developing locally with limited resource Cons

Slide 60

Slide 60 text

● The issues on slow startup time and resource limitation can be fixed by playing around with prerequisites configuration ○ We can choose a minimal subset of data to be fetched during startup ○ Other data can be fetched gradually as the service runs -- or to disable completely during development Cons

Slide 61

Slide 61 text

Final Notes

Slide 62

Slide 62 text

● This approach gives us a powerful programming and architecture model -- coupled with relevant tools like Sciencebox and DataQuery: ○ greatly improved productivity of our engineers ○ enabled the self-suﬀiciency of data scientists ○ improved the rate of iteration on our algorithms -- thus, metrics and business performance Final Notes

Slide 63

Slide 63 text

Thanks!

Slide 64

Slide 64 text

Q & A Session

Slide 65

Slide 65 text

We’re Hiring!

Slide 66

Slide 66 text

● Various positions are open at https://careers.sorabel.io ○ Software Engineers ○ Data Scientists ○ Product Designers ● Projects: ○ Massively diverse kinds of projects in a vertically-integrated company ○ Many diﬀerent companies in one ○ Flexibility to choose projects We’re hiring!

Slide 67

Slide 67 text

● We're between 5-15x smaller in engineering team size to the next biggest ecommerce in the rankings (we only have ~30 engineers!) ○ Each of you joining will have massive impact on the direction of the company ● Immediately productive with: ○ Best-in-class, SV-level infrastructure ○ Infrastructure & tooling team who cares We’re hiring!

Slide 68

Slide 68 text

Thanks! Questions?