Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building A Python-based Search Engine

Building A Python-based Search Engine

PyCon 2012 talk - A talk introducing the fundamentals of search engines via a small pure Python library.

daniellindsley

March 11, 2012
Tweet

More Decks by daniellindsley

Other Decks in Technology

Transcript

  1. Who Am I? • Daniel Lindsley • From Lawrence, KS

    • I run Toast Driven • Consulting & Open Source
  2. Who Am I? • Daniel Lindsley • From Lawrence, KS

    • I run Toast Driven • Consulting & Open Source • Primary author of Haystack
  3. Who Am I? • Daniel Lindsley • From Lawrence, KS

    • I run Toast Driven • Consulting & Open Source • Primary author of Haystack • Adds pluggable search to Django
  4. The Goal • Teach you how search works • Increase

    your comfort with other engines • NOT to develop yet another engine
  5. Why In-House Search? • Standard crawlers have to scrape HTML

    • You know the data model better than they do
  6. Why In-House Search? • Standard crawlers have to scrape HTML

    • You know the data model better than they do • Maybe it’s not a web app at all!
  7. Core Concepts • Document-based • NEVER just looking through a

    string • Inverted Index • Stemming • N-gram
  8. Core Concepts • Document-based • NEVER just looking through a

    string • Inverted Index • Stemming • N-gram • Relevance
  9. Terminology • Engine • Document • Corpus • Stopword •

    Stemming • Position • Segments • Relevance • Faceting • Boost
  10. Stopword A short word that doesn’t contribute to relevance &

    is typically ignored. “and”, “a”, “the”, “but”, etc.
  11. Documents • NOT a row in the DB • Think

    blob of text + metadata • Text quality is THE most important thing!
  12. Documents • NOT a row in the DB • Think

    blob of text + metadata • Text quality is THE most important thing! • Flat, NOT relational!
  13. Documents • NOT a row in the DB • Think

    blob of text + metadata • Text quality is THE most important thing! • Flat, NOT relational! • Denormalize, denormalize, denormalize!
  14. Tokenization • Using the text blob, you: • Split on

    whitespace • Lowercase • Filter out stopwords
  15. Tokenization • Using the text blob, you: • Split on

    whitespace • Lowercase • Filter out stopwords • Strip punctuation
  16. Tokenization • Using the text blob, you: • Split on

    whitespace • Lowercase • Filter out stopwords • Strip punctuation • Etc.
  17. The point is to Normalize the tokens. Consistent little atomic

    units we can assign meaning to & work with.
  18. Stemming • To avoid manually searching through the whole blob,

    you tokenize • More post-processing • THEN! you find the root word
  19. Stemming (cont.) • These become the terms in the inverted

    index • When you do the same to the query, you can match them up
  20. Stemming (cont.) • Cons: • Stemming only works well if

    you know the grammatical structure of the language
  21. Stemming (cont.) • Cons: • Stemming only works well if

    you know the grammatical structure of the language • Most are specific to English, though other stemmers available
  22. Stemming (cont.) • Cons: • Stemming only works well if

    you know the grammatical structure of the language • Most are specific to English, though other stemmers available • Hard to make work cross-language
  23. N-grams • Solves some of the shortcomings of stemming with

    new tradeoffs • Passes a "window" over the tokenized data
  24. N-grams • Solves some of the shortcomings of stemming with

    new tradeoffs • Passes a "window" over the tokenized data • These windows of data become the terms in the index
  25. N-grams (cont.) • Examples (gram size of 3): • hello

    world • [‘hel’, ‘ell’, ‘llo’]
  26. N-grams (cont.) • Examples (gram size of 3): • hello

    world • [‘hel’, ‘ell’, ‘llo’, ‘wor’]
  27. N-grams (cont.) • Examples (gram size of 3): • hello

    world • [‘hel’, ‘ell’, ‘llo’, ‘wor’, ‘orl’]
  28. N-grams (cont.) • Examples (gram size of 3): • hello

    world • [‘hel’, ‘ell’, ‘llo’, ‘wor’, ‘orl’, ‘rld’]
  29. Edge N-grams (cont.) • Typically used with multiple gram sizes

    • Examples (gram size of 3 to 6): • hello world • [‘hel’]
  30. Edge N-grams (cont.) • Typically used with multiple gram sizes

    • Examples (gram size of 3 to 6): • hello world • [‘hel’, ‘hell’]
  31. Edge N-grams (cont.) • Typically used with multiple gram sizes

    • Examples (gram size of 3 to 6): • hello world • [‘hel’, ‘hell’, ‘hello’]
  32. Edge N-grams (cont.) • Typically used with multiple gram sizes

    • Examples (gram size of 3 to 6): • hello world • [‘hel’, ‘hell’, ‘hello’, ‘wor’]
  33. Edge N-grams (cont.) • Typically used with multiple gram sizes

    • Examples (gram size of 3 to 6): • hello world • [‘hel’, ‘hell’, ‘hello’, ‘wor’, ‘worl’]
  34. Edge N-grams (cont.) • Typically used with multiple gram sizes

    • Examples (gram size of 3 to 6): • hello world • [‘hel’, ‘hell’, ‘hello’, ‘wor’, ‘worl’, ‘world’]
  35. N-grams (cont.) • Pros: • Great for autocomplete (matches small

    fragments quickly) • Works across languages (even Asian!)
  36. N-grams (cont.) • Cons: • Lots more terms in the

    index • Initial quality can suffer little
  37. Inverted Index • The heart of the engine • Like

    a dictionary • Keys matter (terms from all docs)
  38. Inverted Index • The heart of the engine • Like

    a dictionary • Keys matter (terms from all docs) • Stores position & document IDs
  39. Segments • Lots of different ways to do this •

    Many follow Lucene • We’re going to cheat & take a slightly simpler approach...
  40. Segments • Flat files • Hashed keys • Always sorted

    • Use JSON for the position/document data
  41. Query Parser • Parse out the structure • Process the

    elements the same way you prepared the document
  42. Index Reader • Per-term, hash the term to get the

    right file • Rip through & collect all the results of positions/documents
  43. Scoring • Reorder the collection of documents based on how

    well each fits the query • Lots of choices • BM25
  44. Scoring • Reorder the collection of documents based on how

    well each fits the query • Lots of choices • BM25 • Phased
  45. Scoring • Reorder the collection of documents based on how

    well each fits the query • Lots of choices • BM25 • Phased • Google’s PageRank
  46. Faceting • For a given field, collect all terms •

    Count the length of the unique document ids for each
  47. Faceting • For a given field, collect all terms •

    Count the length of the unique document ids for each • Order by descending count
  48. Boost • During the scoring process • If a condition

    is met, alter the score accordingly
  49. More Like This • Collect all the terms for a

    given document • Sort based on how many times a document is seen in the set
  50. More Like This • Collect all the terms for a

    given document • Sort based on how many times a document is seen in the set • This is a simplistic view
  51. More Like This • Collect all the terms for a

    given document • Sort based on how many times a document is seen in the set • This is a simplistic view • More complete solutions use NLP to increase quality
  52. Additional Resources • http://nlp.stanford.edu/IR-book/ • http://shop.oreilly.com/product/9780596529321.do • http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters • http://lucene.apache.org/core/old_versioned_docs/versions/

    3_0_0/fileformats.html • http://www.sqlite.org/wal.html • http://snowball.tartarus.org/ • http://sphinxsearch.com/blog/2010/08/17/how-sphinx- relevance-ranking-works/