Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is this "Search" that you speak of? by Honza Král

Pycon ZA
October 06, 2017

What is this "Search" that you speak of? by Honza Král

Fulltext search is hard, or is it? In this talk we will go through the theory and background of search engines all the way to implementing your own search engine in Python.

This process should give everyone insight into how search engines work that can then be applied even when using production-ready systems like Elasticsearch.

Pycon ZA

October 06, 2017
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. What is this "search" that you speak of?? @honzakral

  2. None
  3. "unstructured"

  4. Looking for content

  5. grep -i -r 'web.*framework'

  6. WHERE text ILIKE '%python%'

  7. long, long time ago...

  8. long, long time ago... Bible concordance, finished 1230

  9. 1230

  10. None
  11. Demo Time!

  12. { 'description': { ... 'programming': {1}, 'python': {0, 1}, 'quick':

    {0, 1}, 'reinvent': {0}, ... }, 'title': { ... } }
  13. def index_docs(docs, *fields): index = defaultdict( lambda: defaultdict(set)) for id,

    doc in enumerate(docs): for field in fields: for token in analyze(doc[field]): index[field][token].add(id) return index
  14. SPLIT_RE = re.compile(r'[^a-zA-Z0-9]') def tokenize(text): yield from SPLIT_RE.split(text) def lowercase(tokens):

    for t in tokens: yield t.lower() SYNONYMS = { 'rapid': 'quick', } def synonyms(tokens): for t in tokens: yield SYNONYMS.get(t, t) def analyze(text): tokens = tokenize(text) for token_filter in (lowercase, synonyms): tokens = token_filter(tokens) yield from tokens
  15. COMBINE = { 'OR': set.union, 'AND': set.intersection, } def search_in_fields(index,

    query, fields): for t in analyze(query): yield COMBINE['OR'](*(index[f][t] for f in fields)) def search(index, query, operator='AND', fields=None): fields = fields or index.keys() combine = COMBINE[operator] return combine(*search_in_fields(index, query, fields))
  16. Real world

  17. None
  18. Dictionary dict -> list

  19. Postings List set -> list

  20. Combine set union/intersect -> merge lists

  21. Complex Queries

  22. Prefix py*

  23. Phrase "monty python"

  24. http://bit.ly/searchpy

  25. Thank you! @honzakral http://bit.ly/searchpy