Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing an autosuggest module

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Implementing an autosuggest module

More Decks by Entrepreneurs d'intérêt général

Other Decks in Technology

Transcript

  1. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 1/11 Implementing an autosuggest module Content: 1.

    Why Autosuggest? 2. Which Data ? 3. Clean Data 4. Which Algorithm? 5. Deploy and integrate.. 1. Why Autosuggest? ⏱ save your users time similar to autocomplete --> suggest most probable keywords. avoid misspellings Induce a response time constraint ≈ 16ms provide your user with relevant keywords and/or most frequent queries It's very likely that someone else got the exact same problem before you. Optimize an objective An E-commerce website might want to suggest queries which maximize their revenue..
  2. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 2/11 2. Which Data ? Ideally logs

    of user queries. Often not available when designing and new product. Alternatives: 1. find Data which looks like user Data. Webscrapping (search engines, forums, ...) 2. Use Data of your own collections (titles, nouns chunks, keywords extraction...). You will need a collection of at least a few thousands queries. ⚠ Queries should not necessarily match documents of your database Use of forums posts titles as pseudo queries
  3. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 4/11 In [144]: In [155]: Cleaning optionnally

    involves: - lowercase - remove some punctuation - remove noise - remove accents with ascii - remove long digits - remove dates ... - remove user data.. - remove uninformative queries ("help", "jechangedecoordonnees") - filter out queries having occurence under T - remove queries with small edit distance to frequent queries (carte d'identité, carte d identité) - remove misspelled queries (usually around 20%) most of the time, the cleaning phase is dataset dependent Out[144]: [['filtre', '508627'], ['etatcivil', '381965'], ['filtre', '295580'], ['jechangedecoordonnees', '236041'], ['inscriptionelectorale', '179999'], ['acte de naissance', '79663'], ['fcb', '43800'], ['insregistrefr', '32524'], ['md', '31369'], ["carte d'identité", '21107']] 199966 2439050 with open("./logs-sp.txt") as f: logs = f.read().splitlines() logs = [l.split(";") for l in logs] logs[:10] # couples (query text, number of occurences) logs = logs[5:] # remove first 5 queries print(len(logs)) # number of unique requests print(sum([int(l[1]) for l in logs])) # number of aggregated requests
  4. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 5/11 4. Which Algorithm ? suggestion time

    should remain below 50ms Keep Complexity low your query/document list probabbly will evolve/grow with time the algorithm has 2 steps: indexing step (does'nt need to be fast) search step (keep as low as possible) its easy to trade memory against speed at indexing time Simple baseline
  5. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 6/11 In [277]: simulate with ipywidgets Out[277]:

    [['comment obtenir le récépissé', '332'], ['comment voter par procuration', '154'], ['comment consulter une convention collective', '78'], ['comment voter', '41'], ["comment voter pour le changement d'heure", '36'], ["comment votre pour le changement d'heures", '27'], ['comment le signaler', '25'], ['comment se faire recenser', '20'], ['comment calculer le fermage', '20'], ['comment obtenir le code de cession', '20'], ['comment consulter une convention collective', '19'], ["comment prouver sa qualité d'héritier", '16'], ['comment porter plainte', '15'], ['comment faire si', '15'], ['comment calculer le fermage', '14']] # Simple baseline: # at search time: sort request by frequencies # filter out the documents prefix = "comment" topn = 15 [l for l in logs if l[0].startswith(prefix)][:topn]
  6. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 7/11 In [185]: Complexity of O(N), with

    N: nb of queries In [168]: In [198]: prefix carte [['carte grise', '20512'], ['carte identité', '3591'], ['carte identite', '2878'], ['carte consulaire', '2453'], ["carte grise d'un véhicule d'occasion", '2388'], ['carte de séjour', '2344'], ['carte électorale', '2300'], ['carte vitale', '2173'], ['carte bancaire', '1120'], ["carte européenne d'assurance maladie", '926'], ["carte nationale d'identité d'un majeur", '848'], ["carte grise changement d'adresse", '769'], ['carte grise changement adresse', '611'], ['carte d identite', '599'], ["carte d'identite", '597']] 37.6 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Out[198]: 1999660 from ipywidgets import interact def search(prefix): return [l for l in logs if l[0].startswith(prefix)][:topn] inter = interact(search, prefix = "") %timeit [l for l in logs if l[0].startswith(prefix)][:topn] # lets make the logs list 10 times bigger: big_logs = logs * 10 big_logs = sorted(big_logs, key=lambda x: int(x[1]), reverse=True) len(big_logs)
  7. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 8/11 In [197]: Must read: a great

    blog post about autosuggest optimization and data structures (https://medium.com/related-works-inc/autosuggest-retrieval-data- structures-algorithms-3a902c74ffc8) Trade Memory for cpu time In [232]: In [239]: In [241]: 353 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 11.6 ms ± 323 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit [l for l in big_logs if l[0].startswith(prefix)][:topn] # Idea: make a hashtable of queries by starting string # search only in this table class Search(): def __init__(self, logs): self.index(logs) def index(self, logs): self.first_char = set([l[0][0] for l in logs]) # find all starting characters # make an index with all first char self.hashtab = {} for char in self.first_char: self.hashtab[char] = [l for l in logs if l[0].startswith(char)] def search(self, prefix): subset = self.hashtab[prefix[0]] return [l for l in subset if l[0].startswith(prefix)] s = Search(logs) # build hashtable %timeit s.search(prefix) # perform actual search
  8. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 9/11 You want fancier algorithm? - handle

    typos with fuzzy search - tree search - seasonality - Include User preferences - Learning to rank - NLP models (LM, clustering, Neural Query Embeddings etc..). blog post from etsy: part 1 Data Scturctures and Optimization (https://medium.com/related-works-inc/autosuggest-retrieval-data-structures-algorithms- 3a902c74ffc8) part 2: NLP and Fancy algorithms (https://medium.com/related-works-inc/autosuggest-ranking-d8a3242c2837) Deployment and integration with Elastic: For each search ≈2 suggestions, it's better to have a microservice. if your application is overloaded, the autosuggest will probably be down first. we deployed as a Flask/Gunicorn API (https://github.com/SocialGouv/code-du-travail-numerique/tree/master/packages/code-du-travail-nlp/api): (a simple GET route): get prefix -> return an array of suggestions. wrapped with your front ( few lines of javascript (https://github.com/SocialGouv/code-du-travail-numerique/blob/master/packages/code-du- travail-frontend/src/common/Suggester.js) with debunking to avoid flooding) NOTE: Elastic has a built-in suggester (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html) based on Indexed documents (search as you type) In [244]: In [246]:
  9. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 10/11 In [275]: In [276]: In [

    ]: In [254]: In [ ]: Out[276]: [["carte d'identite", '597'], ['carte d identité', '438'], ['carte d’identité', '410'], ["carte d'identit", '408'], ["carte d'identité", '170'], ['carte didentité', '54'], ["carte d'indentité", '43'], ["carte d'identitée", '28'], ["carte d'identié", '20'], ["carte d'idendité", '17'], ["carte d'dentité", '15'], ["cartes d'identité", '13'], ["carte d'identit", '12'], ["carte d'identitié", '11'], ["carte d'ientité", '11'], ["carte d'identité", '8'], ["carte d'identités", '8'], ["carte d 'identité", '6'], ['carte d´identité', '6'], [' t d id tité' '6'] Out[254]: ['cerfa', '6269'] # examples clean up logs by removing similar requests from Levenshtein import distance def find_similar_queries(query, dist = 1): sim = [distance(q[0], query) for q in logs] small_sim = [l for l, s in zip(logs, sim) if s <= dist] return small_sim find_similar_queries("carte d'identité") 286 ms logs[22]