Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing an autosuggest module

Implementing an autosuggest module

More Decks by Entrepreneurs d'intérêt général

Other Decks in Technology

Transcript

  1. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 1/11 Implementing an autosuggest module Content: 1.

    Why Autosuggest? 2. Which Data ? 3. Clean Data 4. Which Algorithm? 5. Deploy and integrate.. 1. Why Autosuggest? ⏱ save your users time similar to autocomplete --> suggest most probable keywords. avoid misspellings Induce a response time constraint ≈ 16ms provide your user with relevant keywords and/or most frequent queries It's very likely that someone else got the exact same problem before you. Optimize an objective An E-commerce website might want to suggest queries which maximize their revenue..
  2. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 2/11 2. Which Data ? Ideally logs

    of user queries. Often not available when designing and new product. Alternatives: 1. find Data which looks like user Data. Webscrapping (search engines, forums, ...) 2. Use Data of your own collections (titles, nouns chunks, keywords extraction...). You will need a collection of at least a few thousands queries. ⚠ Queries should not necessarily match documents of your database Use of forums posts titles as pseudo queries
  3. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 4/11 In [144]: In [155]: Cleaning optionnally

    involves: - lowercase - remove some punctuation - remove noise - remove accents with ascii - remove long digits - remove dates ... - remove user data.. - remove uninformative queries ("help", "jechangedecoordonnees") - filter out queries having occurence under T - remove queries with small edit distance to frequent queries (carte d'identité, carte d identité) - remove misspelled queries (usually around 20%) most of the time, the cleaning phase is dataset dependent Out[144]: [['filtre', '508627'], ['etatcivil', '381965'], ['filtre', '295580'], ['jechangedecoordonnees', '236041'], ['inscriptionelectorale', '179999'], ['acte de naissance', '79663'], ['fcb', '43800'], ['insregistrefr', '32524'], ['md', '31369'], ["carte d'identité", '21107']] 199966 2439050 with open("./logs-sp.txt") as f: logs = f.read().splitlines() logs = [l.split(";") for l in logs] logs[:10] # couples (query text, number of occurences) logs = logs[5:] # remove first 5 queries print(len(logs)) # number of unique requests print(sum([int(l[1]) for l in logs])) # number of aggregated requests
  4. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 5/11 4. Which Algorithm ? suggestion time

    should remain below 50ms Keep Complexity low your query/document list probabbly will evolve/grow with time the algorithm has 2 steps: indexing step (does'nt need to be fast) search step (keep as low as possible) its easy to trade memory against speed at indexing time Simple baseline
  5. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 6/11 In [277]: simulate with ipywidgets Out[277]:

    [['comment obtenir le récépissé', '332'], ['comment voter par procuration', '154'], ['comment consulter une convention collective', '78'], ['comment voter', '41'], ["comment voter pour le changement d'heure", '36'], ["comment votre pour le changement d'heures", '27'], ['comment le signaler', '25'], ['comment se faire recenser', '20'], ['comment calculer le fermage', '20'], ['comment obtenir le code de cession', '20'], ['comment consulter une convention collective', '19'], ["comment prouver sa qualité d'héritier", '16'], ['comment porter plainte', '15'], ['comment faire si', '15'], ['comment calculer le fermage', '14']] # Simple baseline: # at search time: sort request by frequencies # filter out the documents prefix = "comment" topn = 15 [l for l in logs if l[0].startswith(prefix)][:topn]
  6. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 7/11 In [185]: Complexity of O(N), with

    N: nb of queries In [168]: In [198]: prefix carte [['carte grise', '20512'], ['carte identité', '3591'], ['carte identite', '2878'], ['carte consulaire', '2453'], ["carte grise d'un véhicule d'occasion", '2388'], ['carte de séjour', '2344'], ['carte électorale', '2300'], ['carte vitale', '2173'], ['carte bancaire', '1120'], ["carte européenne d'assurance maladie", '926'], ["carte nationale d'identité d'un majeur", '848'], ["carte grise changement d'adresse", '769'], ['carte grise changement adresse', '611'], ['carte d identite', '599'], ["carte d'identite", '597']] 37.6 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Out[198]: 1999660 from ipywidgets import interact def search(prefix): return [l for l in logs if l[0].startswith(prefix)][:topn] inter = interact(search, prefix = "") %timeit [l for l in logs if l[0].startswith(prefix)][:topn] # lets make the logs list 10 times bigger: big_logs = logs * 10 big_logs = sorted(big_logs, key=lambda x: int(x[1]), reverse=True) len(big_logs)
  7. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 8/11 In [197]: Must read: a great

    blog post about autosuggest optimization and data structures (https://medium.com/related-works-inc/autosuggest-retrieval-data- structures-algorithms-3a902c74ffc8) Trade Memory for cpu time In [232]: In [239]: In [241]: 353 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 11.6 ms ± 323 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit [l for l in big_logs if l[0].startswith(prefix)][:topn] # Idea: make a hashtable of queries by starting string # search only in this table class Search(): def __init__(self, logs): self.index(logs) def index(self, logs): self.first_char = set([l[0][0] for l in logs]) # find all starting characters # make an index with all first char self.hashtab = {} for char in self.first_char: self.hashtab[char] = [l for l in logs if l[0].startswith(char)] def search(self, prefix): subset = self.hashtab[prefix[0]] return [l for l in subset if l[0].startswith(prefix)] s = Search(logs) # build hashtable %timeit s.search(prefix) # perform actual search
  8. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 9/11 You want fancier algorithm? - handle

    typos with fuzzy search - tree search - seasonality - Include User preferences - Learning to rank - NLP models (LM, clustering, Neural Query Embeddings etc..). blog post from etsy: part 1 Data Scturctures and Optimization (https://medium.com/related-works-inc/autosuggest-retrieval-data-structures-algorithms- 3a902c74ffc8) part 2: NLP and Fancy algorithms (https://medium.com/related-works-inc/autosuggest-ranking-d8a3242c2837) Deployment and integration with Elastic: For each search ≈2 suggestions, it's better to have a microservice. if your application is overloaded, the autosuggest will probably be down first. we deployed as a Flask/Gunicorn API (https://github.com/SocialGouv/code-du-travail-numerique/tree/master/packages/code-du-travail-nlp/api): (a simple GET route): get prefix -> return an array of suggestions. wrapped with your front ( few lines of javascript (https://github.com/SocialGouv/code-du-travail-numerique/blob/master/packages/code-du- travail-frontend/src/common/Suggester.js) with debunking to avoid flooding) NOTE: Elastic has a built-in suggester (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html) based on Indexed documents (search as you type) In [244]: In [246]:
  9. 02/05/2019 autosuggest localhost:8889/notebooks/autosuggest.ipynb#Cleaning-Data:-example-with-service-public.fr-Logs 10/11 In [275]: In [276]: In [

    ]: In [254]: In [ ]: Out[276]: [["carte d'identite", '597'], ['carte d identité', '438'], ['carte d’identité', '410'], ["carte d'identit", '408'], ["carte d'identité", '170'], ['carte didentité', '54'], ["carte d'indentité", '43'], ["carte d'identitée", '28'], ["carte d'identié", '20'], ["carte d'idendité", '17'], ["carte d'dentité", '15'], ["cartes d'identité", '13'], ["carte d'identit", '12'], ["carte d'identitié", '11'], ["carte d'ientité", '11'], ["carte d'identité", '8'], ["carte d'identités", '8'], ["carte d 'identité", '6'], ['carte d´identité', '6'], [' t d id tité' '6'] Out[254]: ['cerfa', '6269'] # examples clean up logs by removing similar requests from Levenshtein import distance def find_similar_queries(query, dist = 1): sim = [distance(q[0], query) for q in logs] small_sim = [l for l, s in zip(logs, sim) if s <= dist] return small_sim find_similar_queries("carte d'identité") 286 ms logs[22]