Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Prefix Search as a Service

Jay Shenk
January 17, 2018

Building Prefix Search as a Service

Learn how we built Prefixy: a highly scalable, query optimized, hosted prefix search service for building autocomplete suggestions.

Jay Shenk

January 17, 2018
Tweet

More Decks by Jay Shenk

Other Decks in Programming

Transcript

  1. GLE STYLE AUTOCOMPLETE ERING ENGINE FOR SUGGESTIO DING PREFIX SEARCH

    AS A SER W TO BUILD A GOOGLE STYLE OCOMPLETE POWERING ENGIN GESTIONS BUILDING PREFIX SE SERVICE - HOW TO BUILD A BUILDING PREFIX SEARCH AS A SERVICE
  2. WHAT IS PREFIXY? • A hosted prefix search engine that

    powers autocomplete suggestions • Dynamically updates and ranks autocomplete suggestions based on user input • Easy to use service for app developers to implement on any search field in their app
  3. OVERVIEW • What we prioritized during the research and development

    stages of the project • How we thought about tradeoffs as we designed and built a system from scratch • How we chose the right data structures, algorithms, and data stores for the system • How we built a system with the flexibility to scale as we get more users
  4. IMPORTANT TERMS prefix completions score suggestions selection cade maggio 99

    caleb runte 50 camilia wintheiser 49 camryn hauck 40 cara block 38 cameron brown 10 catie leeroy 10 ... ... ca cade maggio 99 caleb runte 50 camilia wintheiser 49 camryn hauck 40 cara block 38
  5. DESIGN GOALS Requirements • Must have lightning fast reads •

    Suggestions should be dynamically ranked and relevant to the app user Implications & Approach • We want to prioritize speed of reads • We need a ranking algorithm
  6. THREE MEASUREMENTS TO CONSIDER FOR BIG O • N -

    the number of keys/nodes (e.g. prefixes) in our dataset • K - the number of completions for a given prefix • L - the length of the string we are looking up
  7. SOME COMPLETIONS WITH SCORES “car” → 30 “cat” → 90

    “cod” → 10 “cart” → 10 “coin” → 1 “cold” → 5 How should we store these?
  8. Allows for O(L) lookup of a prefix. All descendents of

    a given node share the node as a common prefix, therefore consume less space TRIE IS A NATURAL FIT FOR PREFIX SEARCH c a o r: 30 t: 90 i l n: 1 root d: 5 d: 10 t: 10
  9. SEARCHING FOR COMPLETIONS IN A TRIE c ca co car:

    30 cat: 90 coi col coin: 1 root cold: 5 cod: 10 cart: 10 O(L) + O(N) + O(K log K) [car:30, cart:10, cat:90] 1. Find the prefix O(L) 2. Find all completions (and put them into an array) O(N) 3. Sort the completions O(K log K) Search for all completions that start with “ca” car: 30 cat: 90 cart: 10 c ca root
  10. STORING COMPLETIONS c ca co car cat coi col coin

    root cold cod cart We can store all completions that begin with that prefix in each node. 1. Find the prefix O(L) 2. Find all completions O(N) 3. Sort the completions O(K log K) But this comes at a cost! More space consumed and more writes. O(L) + O(K log K) [car:30, cart:10, cat:90] * Scores omitted for presentation purposes [cold, coin, cod, cat, cart, car]* [car, cart]* [cart]*
  11. Diminishing returns of large L and K FURTHER OPTIMIZATIONS Limit

    max length of prefixes we store Hold L constant • “Tyrannosaurus Rex lived during the late Cretaceous period” Limit completion bucket size in node Hold K constant • Only need enough to support suggestions and ranking O(L) + O(K log K) O(1) + O(1) O(1)
  12. PREFIX HASH TREE c ca co car cat coi col

    coin root cold cod cart • One step access as we no longer need to traverse. • Easy to implement with a key-value NoSQL data store (like Redis!) key value c [car:30, cat…] ca [car:30, cat…] cat [cat:90] car [car:30, cart:10] cart [cart:10] co [cod:10, coin…] coi [coin:1] coin [coin:1] col [cold:5] cold [cold:5] cod [cod:10]
  13. BIG O SUMMARY Search Insert Update / Delete Space Consideration

    Trie O(L) + O(N) + O(K log K) O(L) O(L) completions share prefixes Trie With Completions O(L) + O(K log K) O(L) O(LK) bucket of completions at each prefix node hold L constant O(K log K) O(1) O(K) reduces number of prefixes we store hold K constant O(1) O(1) O(1) caps size bucket of completions size Prefix Hash Tree w/ completions, constant L & K O(1) no traversal! O(1) O(1) slightly more space to accommodate hash table allocation, plus have to duplicate prefixes (“c”, “ca”, “car”)
  14. WHY REDIS? key value c [car:30, cat…] ca [car:30, cat…]

    cat [cat:90] car [car:30, cart:10] cart [cart:10] co [cod:10, coin…] coi [coin:1] coin [coin:1] col [cold:5] cold [cold:5] cod [cod:10] Which Redis data structure to use for completions? • Entirely in-memory meets our performance requirement • Native in-memory data structures for managing completions
  15. REDIS LIST … hi:50 hello:44 help:30 how are you:30 here:28

    happy:29 h how many:29 Key Value • Lists in Redis are a type of linked list • Access head and tail nodes in O(1) • Access other nodes in O(K)
  16. REDIS LIST - SEARCH … hi:50 hello:44 help:30 how are

    you:30 here:28 happy:29 how many:29 [‘hi:50’,‘hello:44’,‘help:30’,‘how are you:30’,‘happy:29’] LRANGE to get first 5 nodes: O(1) 1
  17. REDIS LIST - INCREMENT … hi:50 hello:44 help:30 how are

    you:30 here:28 happy:29 how many:29 [‘hi:50’,‘hello:44’,‘help:30’,‘how are you:30’,‘happy:29’,‘how many:29’,’here:28’, ...] LRANGE to get entire list: O(K) 1
  18. … hi:50 hello:44 help:30 how are you:30 here:28 happy:29 how

    many:29 [‘hi:50’,‘hello:44’,‘help:30’,‘how are you:30’,’happy:30’,‘how many:29’,‘here:29’, ...] Binary search of returned array to find and increment completion: O(log K) 2 REDIS LIST - INCREMENT
  19. … hi:50 hello:44 help:30 how are you:30 here:28 happy:29 how

    many:29 [‘hi:50’,‘hello:44’,’happy:30’,‘help:30’,‘how are you:30’,‘how many:29’,‘here:29’, ...] Binary search for insertion in new location: O(log K) 3 REDIS LIST - INCREMENT
  20. … hi:50 hello:44 help:30 how are you:30 here:28 happy:29 how

    many:29 LREM to remove completion from its current position: O(K) 4 REDIS LIST - INCREMENT
  21. … hi:50 hello:44 happy:30 help:30 here:28 how are you:30 how

    many:29 LINSERT to insert completion in its new position: O(K) 5 REDIS LIST - INCREMENT
  22. REDIS LIST Pros • O(1) search Cons Reads: O(1) Writes:

    O(K) *Round Trips: 2 • 2 round trips per update • May have concurrency issues • Large payload • No uniqueness guarantee • Have to sort in JavaScript instead of on DB level … hi:50 hello:44 happy:30 help:30 here:28 how are you:30 how many:29
  23. REDIS SORTED SET completion score hi 50 hello 44 help

    30 how are you 30 happy 29 how many 29 here 28 ... ... h Key Value • Sorted sets in Redis are implemented with skip lists • Handles uniqueness and order • Most operations are O(log K)
  24. completion score hi 50 hello 44 help 30 how are

    you 30 happy 29 how many 29 here 28 ... ... [‘hi’,‘hello’,‘help’,‘how are you’,‘happy’] ZRANGE to get first 5 elements: O(log K) 1 REDIS SORTED SET - SEARCH
  25. REDIS SORTED SET - INCREMENT completion score hi 50 hello

    44 happy 30 help 30 how are you 30 happy 29 how many 29 here 28 ... ... ZINCRBY to increment score: O(log K) Redis handles order and uniqueness constraint. 1
  26. REDIS SORTED SET Cons • Search is technically O(log K)

    Pros completion score hi 50 hello 44 happy 30 help 30 how are you 30 how many 29 here 28 ... ... Reads: O(log K) Writes: O(log K) *Round Trips: 1 • Fewer round trips • Less chance of concurrency issues • Smaller payloads • Uniqueness guarantee • Faster than doing it in JS • Non-blocking
  27. MAINTAINING BUCKET LIMIT (K) completion score ... ... java 15

    jquery 10 jshint 10 completion score ... ... java 15 javascript 11 jquery 10 User submits a search for “javascript” 1 Since our bucket is full, we remove the lowest ranked completion Insert “javascript” with “jshint”’s score plus 1 3 2
  28. WHAT IF WE RUN OUT OF MEMORY? • Set an

    LRU policy in Redis • Persist to MongoDB: able to store more than what we can fit in memory • Reads still fast: generally 1 trip per search, more trips for updates Redis MongoDB Prefixy Always check Redis first 1 If we have a cache miss, check Mongo 2
  29. TOKEN GENERATION + AUTHENTICATION WORKFLOW Prefixy Token Generator client.js Jane,

    an app developer, visits token generator to get her JWT + custom scripts Now any request to Prefixy sent from Jane’s site will include her JWT Token generator creates a unique tenant ID, and then encrypts it into a JWT Jane includes her custom script in the frontend code of her web application Server decrypts JWT to get tenant ID 1 2 3 4 5
  30. Prefixy MongoDB Redis Jane’s client.js MULTI-TENANCY WORKFLOW Prefixy decrypts the

    JWT to get Jane’s tenant ID Searches from Jane’s client.js will send the request to Prefixy with Jane’s JWT 1 2 Prefixy uses Jane’s tenant ID to get the data that is specific to her site 3
  31. MULTI-TENANCY ON THE BACK-END <tenantId> { prefix completions } key

    value <tenantId>:c ... <tenantId>:ca ... <tenantId>:cam ... { prefix completions } { prefix: ‘c’, completions: [...] } Redis Mongo We allocate a Mongo collection to Jane, and the name of this collection is her tenant ID In Redis, we prepend every key of Jane’s data with her tenant ID
  32. function valueChanged() { const value = this.input.value; if (value.length >=

    this.minChars) { this.fetchSuggestions(value, suggestions => { this.visible = true; this.suggestions = suggestions; this.draw(); }); } else { this.reset(); } } CLIENT.JS function fetchSuggestions(query, callback) { const params = { prefix: query, token: this.token }; if (this.suggestionCount) { params.limit = this.suggestionCount; } axios.get(this.completionsUrl, { params }) .then(response => callback(response.data)); } When the user types in the search field... 1 Get the suggestions from Prefixy 2
  33. function draw() { ... this.suggestions.forEach((suggestion, index) => { const li

    = document.createElement('li'); const span1 = document.createElement('span'); const span2 = document.createElement('span') span1.classList.add('suggestion', 'typed'); span2.classList.add('suggestion'); span1.textContent = suggestion.match(typed); span2.textContent = suggestion.slice(span1.textContent.length); li.appendChild(span1); li.appendChild(span2); this.listUI.appendChild(li); }); } CLIENT.JS … and show suggestions to user 3
  34. 1 async function search(prefixQuery, tenant, opts={}) { 2 const defaultOpts

    = { limit: this.suggestionCount, withScores: false }; 3 opts = { ...defaultOpts, ...opts } 4 const limit = opts.limit - 1; 5 const prefix = this.normalizePrefix(prefixQuery); 6 const prefixWithTenant = this.addTenant(prefix, tenant); 7 8 let args = [prefixWithTenant, 0, limit]; 9 if (opts.withScores) args = args.concat('WITHSCORES'); 10 11 let result = await this.redisClient.zrangeAsync(...args); 12 13 if (result.length === 0) { 14 await this.mongoLoad(prefix, tenant); 15 result = await this.redisClient.zrangeAsync(...args); 16 } 17 18 return result; 19 } SEARCH
  35. 1 async function increment(completion, tenant) { 3 const prefixes =

    this.extractPrefixes(completion); 4 const commands = []; 5 6 for (let i = 0; i < prefixes.length; i++) { 7 let prefixWithTenant = this.addTenant(prefixes[i], tenant); 8 let count = await this.getCompletionsCount(prefixes[i], tenant, prefixWithTenant); 9 const includesCompletion = await this.redisClient.zscoreAsync(prefixWithTenant, completion); 10 11 if (count >= this.limit && !includesCompletion) { 12 const lastPosition = this.limit - 1; 13 const lastElement = await this.redisClient.zrangeAsync(prefixWithTenant, lastPosition, 'WITHSCORES'); 14 const newScore = lastElement[1] - 1; 15 commands.push(['zremrangebyrank', prefixWithTenant, lastPosition, -1]); 16 commands.push(['zadd', prefixWithTenant, newScore, completion]); 17 } else { 18 commands.push(['zincrby', prefixWithTenant, -1, completion]); 19 } 20 } 21 return this.redisClient.batch(commands).execAsync().then(async () => { 22 await this.persistPrefixes(prefixes, tenant); 23 }); INCREMENT
  36. FUTURE PLANS • Custom configurability of L and K •

    Scaling Redis to minimize cache misses • Scripting Redis to reduce network requests (e.g. write-through cache) • API rate limiter