Paper reading - ROSE: Robust Caches for Amazon Product Search

ROSE: Robust Caches for Amazon Product Search Kanji Yomoda (@k-yomo)
May 2022

Conﬁdential & Proprietary 2021 Outline • Diﬃculty of the cache
in search system • ROSE ◦ Index Generation ◦ Online Retrieval ◦ Theoretical Analysis ◦ Experiments • Summary

Conﬁdential & Proprietary 2021 Diﬃculty of the cache in search
system • High cardinality of queries increases cache size and dismiss search performance • Typos, misspellings, and redundancy of the queries lead cache miss ◦ ”Nike shoes”, “Nike shoe”, and “Nike’s shoe”, ”Nike shooes”, ”Shoes Nike”

Conﬁdential & Proprietary 2021 Diﬃculty of the cache in search
system ”Nike shooes” Search Backend - Search Engine - ML models ”Nike shoes” ”Nike’s shoes” C a c h e

Conﬁdential & Proprietary 2021 ROSE

Conﬁdential & Proprietary 2021 ROSE ”Nike shoes” Search System -
Search Engine - ML models C a c h e ”Nike shoes” ”Nike shoes” ”Nike shooes” R O S E ”Nike shoes” ”Nike footwear” ROSE is a robust cache that maps an online query to cached queries.

Conﬁdential & Proprietary 2021 ROSE - Requirements • Cache needs
to capture the query similarity = robust to typos and semantic variance • Cache size needs to avoid scaling with the volume of queries • Lookup cost needs to be constant-time

Conﬁdential & Proprietary 2021 Index generation

Conﬁdential & Proprietary 2021 LSH - Locality Sensitive Hashing

Conﬁdential & Proprietary 2021 LSH - Deﬁnition R: threshold c:
approximation factor h(x) = (hy): x and y collide (= same bucket)

Conﬁdential & Proprietary 2021 LSH - Minwise Hashing

Conﬁdential & Proprietary 2021 LSH - Minwise Hashing S1={2,5,7,9}, S2={1,2,4,7,10}
Minwise Hashing shuffle(S1) shuffle(S2) Pr(S1[0] == S2[0]) = 2/7 Example Jaccard similarity S1∪S2={1,2,4,5,7,9,10} S1∩S2={2,7} S1∩S2 / S1∪S2 = 2/7 ≒

Conﬁdential & Proprietary 2021 LSH - Minwise Hashing jaccard similarity

to capture the query similarity => LHS (Locality Sensitive Hashing) • Cache size needs to avoid scaling with the volume of queries. • Lookup cost needs to be constant-time

Conﬁdential & Proprietary 2021 Reservoir sampling algorithm processes a stream
of m numbers and can generate R uniform samples only using an array of size R

Conﬁdential & Proprietary 2021 Reservoir sampling algorithm Example: Choose 1(R)
person equally randomly out of m=? ・・・ 1/1 1/2 1/3

to capture the query similarity => LHS (Locality Sensitive Hashing) • Cache size needs to avoid scaling with the volume of queries. => Reservoir sampling algorithm • Lookup cost must be constant-time

Conﬁdential & Proprietary 2021 Online retrieval

Conﬁdential & Proprietary 2021 Online retrieval 1. Computing the LSH
signature of this query and looking up the corresponding bucket in the hash tables 2. Rank the similarity of the cached queries within the bucket to the new search and return the top result

Conﬁdential & Proprietary 2021 Online retrieval

Conﬁdential & Proprietary 2021 Count-based 𝑘-selection

Conﬁdential & Proprietary 2021 Count-based 𝑘-selection Count collision and rank

to capture the query similarity => LHS (Locality Sensitive Hashing) • Cache size needs to avoid scaling with the volume of queries. => Reservoir sampling algorithm • Lookup cost must be constant-time => Count-based 𝑘-selection

Conﬁdential & Proprietary 2021 Theoretical Analysis L=Number of LSH, N=Number
of query T=Average number of tokens, B=Bucket size Indexing Step Time Complexity 𝑂(𝐿·𝑁·𝑇 ) => O(N) in practice since L and H are small constants. L=Number of LSH, N=Number of query, T=Average number of tokens Retrieval Step Time Complexity O(𝐿T·𝐵L) => O(1) in practice since L,B, and T are small constants. (LT=calculating the hash values, BL=k-selection in the combined sets) Memory Complexity 𝑂(L·𝑁B·B) => memory usage is not increasing with the size of the cache (NB=Number of buckets in LSH)

to capture the query similarity => LHS (Locality Sensitive Hashing) • Cache size needs to avoid scaling with the volume of queries. => Reservoir sampling algorithm • Lookup cost must be constant-time => Count-based 𝑘-selection

Conﬁdential & Proprietary 2021 Deployment in Amazon.com • ROSE for
query rewrite ◦ Rewrite queries to improve cache hit ratio and search experience • ROSE for Product Type Annotation ◦ Identifying the correct product type from the query and apply product type ﬁlter

Conﬁdential & Proprietary 2021 Experiments Result Cache the intended product
type of 5- 10 million frequent queries and measured metrics with and without product type recognition

Conﬁdential & Proprietary 2021 Experiments Result With ROSE, most of
the search traﬃc is covered with single digit milliseconds latency

Conﬁdential & Proprietary 2021 Summary • ROSE improved both search
performance by rewriting tail query and ﬁlter by query type search latency by robust caching • Several algorithms (LSH, Minhash, Reservoir sampling algorithm) used to reduce time / space complexity • Keep query similarity precision by preserving lexical similarity and product type

Conﬁdential & Proprietary 2021 References • ROSE: Robust Caches for
Amazon Product Search • Locality Sensitive Hashing (LSH): The Illustrated Guide • MinHashによる高速な類似検索 • Some Rare LSH Gems for Large-scale Machine Learning

Conﬁdential & Proprietary 2021 Thanks!

Paper reading - ROSE: Robust Caches for Amazon ...

Paper reading - ROSE: Robust Caches for Amazon Product Search

Kanji Yomoda

More Decks by Kanji Yomoda

Other Decks in Technology

Featured

Transcript

ROSE: Robust Caches for Amazon Product Search Kanji Yomoda (@k-yomo)

Conﬁdential & Proprietary 2021 Outline • Diﬃculty of the cache

Conﬁdential & Proprietary 2021 Diﬃculty of the cache in search

Conﬁdential & Proprietary 2021 Diﬃculty of the cache in search

Conﬁdential & Proprietary 2021 ROSE

Conﬁdential & Proprietary 2021 ROSE ”Nike shoes” Search System -

Conﬁdential & Proprietary 2021 ROSE - Requirements • Cache needs

Conﬁdential & Proprietary 2021 Index generation

Conﬁdential & Proprietary 2021 Index generation

Conﬁdential & Proprietary 2021 LSH - Locality Sensitive Hashing

Conﬁdential & Proprietary 2021 LSH - Deﬁnition R: threshold c:

Conﬁdential & Proprietary 2021 LSH - Minwise Hashing

Conﬁdential & Proprietary 2021 LSH - Minwise Hashing S1={2,5,7,9}, S2={1,2,4,7,10}

Conﬁdential & Proprietary 2021 LSH - Minwise Hashing jaccard similarity

Conﬁdential & Proprietary 2021 ROSE - Requirements • Cache needs

Conﬁdential & Proprietary 2021 Reservoir sampling algorithm processes a stream

Conﬁdential & Proprietary 2021 Reservoir sampling algorithm Example: Choose 1(R)

Conﬁdential & Proprietary 2021 ROSE - Requirements • Cache needs

Conﬁdential & Proprietary 2021 Online retrieval

Conﬁdential & Proprietary 2021 Online retrieval 1. Computing the LSH

Conﬁdential & Proprietary 2021 Online retrieval

Conﬁdential & Proprietary 2021 Count-based 𝑘-selection

Conﬁdential & Proprietary 2021 Count-based 𝑘-selection Count collision and rank

Conﬁdential & Proprietary 2021 ROSE - Requirements • Cache needs

Conﬁdential & Proprietary 2021 Theoretical Analysis L=Number of LSH, N=Number

Conﬁdential & Proprietary 2021 ROSE - Requirements • Cache needs

Conﬁdential & Proprietary 2021 Deployment in Amazon.com • ROSE for

Conﬁdential & Proprietary 2021 Experiments Result Cache the intended product

Conﬁdential & Proprietary 2021 Experiments Result With ROSE, most of

Conﬁdential & Proprietary 2021 Summary • ROSE improved both search

Conﬁdential & Proprietary 2021 References • ROSE: Robust Caches for

Conﬁdential & Proprietary 2021 Thanks!