Autocomplete: The Tale of the First Few Keystrokes

‹#› Areek Zillur 17th Feb 2016, Software Engineer, Elastic Autocomplete:
Tale of the first few Keystrokes

Who am I? • Apache Lucene committer & PMC member
• Elasticsearch core developer • Work: • Email: [email protected]

Agenda 3 What is autocomplete? 1 The reality 3 4
What is in the pipeline? 5 The expectation 2 The tale of the first few keystrokes Before diving deep, we will ﬁrst what we mean by autocomplete We will scope out some expecte criteria for a good autocomplete system. Things to consider when we des such a system. Then take a look into what data structures and algorithms are us internally to ensure

Navigational feature to guide users to the right content in
a few keystrokes 5

Agenda 6 What is autocomplete? The reality 3 4 What
is in the pipeline? 5 The expectation The tale of the first few keystrokes 2 1

In most e-commerce sites, autocomplete is a user’s first point
of contact, user experience is important! 7

Be Responsive! 8 • Must serve a request per keystroke
• Should be as fast as a user types • Must not serve unrelated suggestions to what the user has typed in

Be Relevant! 9 • Must have mechanism to rank suggestions
according to business needs • Good suggestion weights: • Improve search precision • Allow serving fewer suggestions without compromising quality

Be Forgiving! 10 • Should tolerate user typos • “San
Fransico” should match “San Francisco” • Should allow for business relevant data analysis on user query • Example: • Lower casing and stop word removal • “incredibles” should match “The Incredibles” • Synonyms • “usa” should match “america”

Agenda 11 What is autocomplete? The reality 4 What is
in the pipeline? 5 The expectation The tale of the first few keystrokes 1 3 2

Steps to autocomplete 12 Curate suggestions and assign weights 1
2 3 4 Index as completions using proper analyzer Test suggestion quality Repeat until profit!

What can elasticsearch do? 13 • Support high query rate
for prefix queries • Memory efficient index for heap residency • Index ideal for concurrency Be Responsive! • Search algorithm supports sorting by index-time weight in one pass • Support near-real time search • Support filtering and boosting suggestions Be Relevant! • Support analyzers at index and query time • Support typo-tolerant (fuzzy) suggestions Be Forgiving!

What can you do? 14 • Accomodate unique request pattern
• Minimize network latency • Prefer single shard index • Simplify query analysis Be Responsive! • Invest in suggestion weights • Minimize number of suggestions served • Update suggestions to reflect the latest and greatest • cleanse suggestion entries Be Relevant! • Chose suitable index and query time analysis • Use typo-tolerant suggester appropriately Be Forgiving!

Agenda 15 What is autocomplete? The reality What is in
the pipeline? 5 The expectation The tale of the first few keystrokes 1 2 4 3

The Index - Weighted Finite State Transducer (wFST) • Conceptually
a SortedMap optimized for fast lookup on key prefixes • Memory efficient data structure • 50% larger than gzip compressed [1] • Supports high query rate • can be searched at a rate of 275,000 queries/sec [1] • Implementation is optimized for concurrency • Write-once & read-only • In-memory (byte[]-serialized) 16 [1] - http://aaron.blog.archive.org/2013/05/29/worlds-biggest-fst/

Input Output Weight apple Apple 3 apricot Apricot 2 banana
Banana 4 beets Beets 3 • Shares key prefixes • Encodes metadata in edges • Pushdown weights

Input Output Weight apple Apple 3 apricot Apricot 2 banana
Banana 4 beets Beets 3 • Example query prefix: “ap” • Prune search path for query prefix • Minimum weights on next edges used for informed search, ensures collection according to ranking • Early terminate once enough suggestions collected

Notes • Query prefix is represented as automaton • Levenshtein
automaton used for typo-tolerance in query prefix • Can support regular expressions • Implementation: LUCENE-3842/ ES completion suggester 19 [1] - http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automaton [1]

Suggestion filtering 20 • Suggestions are prefixed with a context
value in the wFST • a suggestion of “star wars” with a context of “dvd” is indexed as “dvd_starwars” • Query is prefixed with context value to filter out irrelevant suggestions • a user query of “st” with a context of “dvd” generates “dvd_st”

Wait, why not just use prefix query? 21

Query performance with increasing prefix length 22 KQPS 0 10
20 30 40 prefix length 1 2 3 4 5 6

Agenda 23 What is autocomplete? The reality What is in
the pipeline? The expectation The tale of the first few keystrokes 1 2 3 5 4

Improvements 24 • Link suggestion entries to documents • wFST
entries store an additional unique document id • facilitates near-real time search • enables retrieving arbitrary document fields • one step closer to using wFST index in normal queries • Suggestion boosting • see: LUCENE-6339/ ES Completion Suggester post-2.x

Suggestion boosting 25 • Idea: boost suggestions based on context
value by adjusting the edge weights • Using geohash as context, the search can be biased towards entries whose geohash context is closer to that of the query • Example: boosting scheme for a query geohash “gfm673rhb8” Context geohash Distance from query location Boost factor gfmvd5u3rv 75 3 gfmj8qvdsb 167 2 gfz0zxuu70 324 1

Conclusion • Autocomplete systems must be responsive, serve relevant results
and handle common user omissions and typos • Good autocomplete solutions will guide users to the right content in a few keystrokes • In practice, quality suggestions with proper weights are necessary for a great autocomplete solution • Completion (wFST) index are optimized for ranked prefix queries 26

Thanks for listening! Questions? 27

Autocomplete: The Tale of the First Few Keystrokes

Autocomplete: The Tale of the First Few Keystrokes

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

‹#› Areek Zillur 17th Feb 2016, Software Engineer, Elastic Autocomplete:

Who am I? • Apache Lucene committer & PMC member

Agenda 3 What is autocomplete? 1 The reality 3 4

4

Navigational feature to guide users to the right content in

Agenda 6 What is autocomplete? The reality 3 4 What

In most e-commerce sites, autocomplete is a user’s first point

Be Responsive! 8 • Must serve a request per keystroke

Be Relevant! 9 • Must have mechanism to rank suggestions

Be Forgiving! 10 • Should tolerate user typos • “San

Agenda 11 What is autocomplete? The reality 4 What is

Steps to autocomplete 12 Curate suggestions and assign weights 1

What can elasticsearch do? 13 • Support high query rate

What can you do? 14 • Accomodate unique request pattern

Agenda 15 What is autocomplete? The reality What is in

The Index - Weighted Finite State Transducer (wFST) • Conceptually

Input Output Weight apple Apple 3 apricot Apricot 2 banana

Input Output Weight apple Apple 3 apricot Apricot 2 banana

Notes • Query prefix is represented as automaton • Levenshtein

Suggestion filtering 20 • Suggestions are prefixed with a context

Wait, why not just use prefix query? 21

Query performance with increasing prefix length 22 KQPS 0 10

Agenda 23 What is autocomplete? The reality What is in

Improvements 24 • Link suggestion entries to documents • wFST

Suggestion boosting 25 • Idea: boost suggestions based on context

Conclusion • Autocomplete systems must be responsive, serve relevant results

Thanks for listening! Questions? 27