Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Autocomplete: The Tale of the First Few Keystrokes

Elastic Co
February 17, 2016

Autocomplete: The Tale of the First Few Keystrokes

Autocomplete is a key feature of search applications. Good autocomplete solutions get users to the right content in a few keystrokes while bad ones frustrate and confuse. This talk will explore technical considerations in designing such systems, what Elasticsearch has to offer, and what’s to come next.

Elastic Co

February 17, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Who am I? • Apache Lucene committer & PMC member

    • Elasticsearch core developer • Work: • Email: [email protected]
  2. Agenda 3 What is autocomplete? 1 The reality 3 4

    What is in the pipeline? 5 The expectation 2 The tale of the first few keystrokes Before diving deep, we will first what we mean by autocomplete We will scope out some expecte criteria for a good autocomplete system. Things to consider when we des such a system. Then take a look into what data structures and algorithms are us internally to ensure
  3. 4

  4. Agenda 6 What is autocomplete? The reality 3 4 What

    is in the pipeline? 5 The expectation The tale of the first few keystrokes 2 1
  5. In most e-commerce sites, autocomplete is a user’s first point

    of contact, user experience is important! 7
  6. Be Responsive! 8 • Must serve a request per keystroke

    • Should be as fast as a user types • Must not serve unrelated suggestions to what the user has typed in
  7. Be Relevant! 9 • Must have mechanism to rank suggestions

    according to business needs • Good suggestion weights: • Improve search precision • Allow serving fewer suggestions without compromising quality
  8. Be Forgiving! 10 • Should tolerate user typos • “San

    Fransico” should match “San Francisco” • Should allow for business relevant data analysis on user query • Example: • Lower casing and stop word removal • “incredibles” should match “The Incredibles” • Synonyms • “usa” should match “america”
  9. Agenda 11 What is autocomplete? The reality 4 What is

    in the pipeline? 5 The expectation The tale of the first few keystrokes 1 3 2
  10. Steps to autocomplete 12 Curate suggestions and assign weights 1

    2 3 4 Index as completions using proper analyzer Test suggestion quality Repeat until profit!
  11. What can elasticsearch do? 13 • Support high query rate

    for prefix queries • Memory efficient index for heap residency • Index ideal for concurrency Be Responsive! • Search algorithm supports sorting by index-time weight in one pass • Support near-real time search • Support filtering and boosting suggestions Be Relevant! • Support analyzers at index and query time • Support typo-tolerant (fuzzy) suggestions Be Forgiving!
  12. What can you do? 14 • Accomodate unique request pattern

    • Minimize network latency • Prefer single shard index • Simplify query analysis Be Responsive! • Invest in suggestion weights • Minimize number of suggestions served • Update suggestions to reflect the latest and greatest • cleanse suggestion entries Be Relevant! • Chose suitable index and query time analysis • Use typo-tolerant suggester appropriately Be Forgiving!
  13. Agenda 15 What is autocomplete? The reality What is in

    the pipeline? 5 The expectation The tale of the first few keystrokes 1 2 4 3
  14. The Index - Weighted Finite State Transducer (wFST) • Conceptually

    a SortedMap optimized for fast lookup on key prefixes • Memory efficient data structure • 50% larger than gzip compressed [1] • Supports high query rate • can be searched at a rate of 275,000 queries/sec [1] • Implementation is optimized for concurrency • Write-once & read-only • In-memory (byte[]-serialized) 16 [1] - http://aaron.blog.archive.org/2013/05/29/worlds-biggest-fst/
  15. Input Output Weight apple Apple 3 apricot Apricot 2 banana

    Banana 4 beets Beets 3 • Shares key prefixes • Encodes metadata in edges • Pushdown weights
  16. Input Output Weight apple Apple 3 apricot Apricot 2 banana

    Banana 4 beets Beets 3 • Example query prefix: “ap” • Prune search path for query prefix • Minimum weights on next edges used for informed search, ensures collection according to ranking • Early terminate once enough suggestions collected
  17. Notes • Query prefix is represented as automaton • Levenshtein

    automaton used for typo-tolerance in query prefix • Can support regular expressions • Implementation: LUCENE-3842/ ES completion suggester 19 [1] - http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automaton [1]
  18. Suggestion filtering 20 • Suggestions are prefixed with a context

    value in the wFST • a suggestion of “star wars” with a context of “dvd” is indexed as “dvd_starwars” • Query is prefixed with context value to filter out irrelevant suggestions • a user query of “st” with a context of “dvd” generates “dvd_st”
  19. Agenda 23 What is autocomplete? The reality What is in

    the pipeline? The expectation The tale of the first few keystrokes 1 2 3 5 4
  20. Improvements 24 • Link suggestion entries to documents • wFST

    entries store an additional unique document id • facilitates near-real time search • enables retrieving arbitrary document fields • one step closer to using wFST index in normal queries • Suggestion boosting • see: LUCENE-6339/ ES Completion Suggester post-2.x
  21. Suggestion boosting 25 • Idea: boost suggestions based on context

    value by adjusting the edge weights • Using geohash as context, the search can be biased towards entries whose geohash context is closer to that of the query • Example: boosting scheme for a query geohash “gfm673rhb8” Context geohash Distance from query location Boost factor gfmvd5u3rv 75 3 gfmj8qvdsb 167 2 gfz0zxuu70 324 1
  22. Conclusion • Autocomplete systems must be responsive, serve relevant results

    and handle common user omissions and typos • Good autocomplete solutions will guide users to the right content in a few keystrokes • In practice, quality suggestions with proper weights are necessary for a great autocomplete solution • Completion (wFST) index are optimized for ranked prefix queries 26