Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is this "Search" that you speak of? by Ho...

Pycon ZA
October 06, 2017

What is this "Search" that you speak of? by Honza Král

Fulltext search is hard, or is it? In this talk we will go through the theory and background of search engines all the way to implementing your own search engine in Python.

This process should give everyone insight into how search engines work that can then be applied even when using production-ready systems like Elasticsearch.

Pycon ZA

October 06, 2017
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. { 'description': { ... 'programming': {1}, 'python': {0, 1}, 'quick':

    {0, 1}, 'reinvent': {0}, ... }, 'title': { ... } }
  2. def index_docs(docs, *fields): index = defaultdict( lambda: defaultdict(set)) for id,

    doc in enumerate(docs): for field in fields: for token in analyze(doc[field]): index[field][token].add(id) return index
  3. SPLIT_RE = re.compile(r'[^a-zA-Z0-9]') def tokenize(text): yield from SPLIT_RE.split(text) def lowercase(tokens):

    for t in tokens: yield t.lower() SYNONYMS = { 'rapid': 'quick', } def synonyms(tokens): for t in tokens: yield SYNONYMS.get(t, t) def analyze(text): tokens = tokenize(text) for token_filter in (lowercase, synonyms): tokens = token_filter(tokens) yield from tokens
  4. COMBINE = { 'OR': set.union, 'AND': set.intersection, } def search_in_fields(index,

    query, fields): for t in analyze(query): yield COMBINE['OR'](*(index[f][t] for f in fields)) def search(index, query, operator='AND', fields=None): fields = fields or index.keys() combine = COMBINE[operator] return combine(*search_in_fields(index, query, fields))