Who Am I?
• Daniel Lindsley
• From Lawrence, KS
• I run Toast Driven
Slide 15
Slide 15 text
Toast Driven
Slide 16
Slide 16 text
Who Am I?
• Daniel Lindsley
• From Lawrence, KS
• I run Toast Driven
• Consulting & Open Source
Slide 17
Slide 17 text
Who Am I?
• Daniel Lindsley
• From Lawrence, KS
• I run Toast Driven
• Consulting & Open Source
• Primary author of Haystack
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
Who Am I?
• Daniel Lindsley
• From Lawrence, KS
• I run Toast Driven
• Consulting & Open Source
• Primary author of Haystack
• Adds pluggable search to Django
Slide 20
Slide 20 text
The Goal
Slide 21
Slide 21 text
The Goal
• Teach you how search works
Slide 22
Slide 22 text
The Goal
• Teach you how search works
• Increase your comfort with other engines
Slide 23
Slide 23 text
The Goal
• Teach you how search works
• Increase your comfort with other engines
• NOT to develop yet another engine
Slide 24
Slide 24 text
Why Care About
Search?
Slide 25
Slide 25 text
Why Care About
Search?
“Doesn’t the handle that?”
Slide 26
Slide 26 text
Why In-House Search?
• Standard crawlers have to scrape HTML
Slide 27
Slide 27 text
Why In-House Search?
• Standard crawlers have to scrape HTML
• You know the data model better than
they do
Slide 28
Slide 28 text
Why In-House Search?
• Standard crawlers have to scrape HTML
• You know the data model better than
they do
• Maybe it’s not a web app at all!
Slide 29
Slide 29 text
Core Concepts
Slide 30
Slide 30 text
Core Concepts
• Document-based
Slide 31
Slide 31 text
Core Concepts
• Document-based
• NEVER just looking through a string
Slide 32
Slide 32 text
Core Concepts
• Document-based
• NEVER just looking through a string
• Inverted Index
Slide 33
Slide 33 text
Core Concepts
• Document-based
• NEVER just looking through a string
• Inverted Index
• Stemming
Slide 34
Slide 34 text
Core Concepts
• Document-based
• NEVER just looking through a string
• Inverted Index
• Stemming
• N-gram
Slide 35
Slide 35 text
Core Concepts
• Document-based
• NEVER just looking through a string
• Inverted Index
• Stemming
• N-gram
• Relevance
Engine
The black box you hand a query to & get results
from.
Slide 41
Slide 41 text
Document
Slide 42
Slide 42 text
No content
Slide 43
Slide 43 text
Document
A text blob with optional metadata.
Slide 44
Slide 44 text
Corpus
Slide 45
Slide 45 text
No content
Slide 46
Slide 46 text
Corpus
The collection of all documents.
Slide 47
Slide 47 text
Stopword
Slide 48
Slide 48 text
No content
Slide 49
Slide 49 text
Stopword
A short word that doesn’t contribute to relevance
& is typically ignored.
“and”, “a”, “the”, “but”, etc.
Slide 50
Slide 50 text
Stemming
Slide 51
Slide 51 text
No content
Slide 52
Slide 52 text
Stemming
Finding the root of a word.
Slide 53
Slide 53 text
Segments
Slide 54
Slide 54 text
No content
Slide 55
Slide 55 text
Segments
Sharded data storing the inverted index.
Slide 56
Slide 56 text
Relevance
Slide 57
Slide 57 text
No content
Slide 58
Slide 58 text
Relevance
The algorithm(s) used to rank the results
based on the query.
Slide 59
Slide 59 text
Faceting
Slide 60
Slide 60 text
No content
Slide 61
Slide 61 text
Faceting
Providing counts of results within the results
that match certain criteria.
“Drill-down”
Slide 62
Slide 62 text
Boost
Slide 63
Slide 63 text
No content
Slide 64
Slide 64 text
Boost
Artificially enhancing the relevance of certain
documents based on a condition.
Slide 65
Slide 65 text
This concludes the Funny Cat Pictures
portion of this presentation.
Please remain seated and calm.
Slide 66
Slide 66 text
No content
Slide 67
Slide 67 text
Indexing
Slide 68
Slide 68 text
Indexing
• Four Main Components
• Receiving/Storing Documents
Slide 69
Slide 69 text
Indexing
• Four Main Components
• Receiving/Storing Documents
• Tokenization
Slide 70
Slide 70 text
Indexing
• Four Main Components
• Receiving/Storing Documents
• Tokenization
• Generating Terms
Slide 71
Slide 71 text
Indexing
• Four Main Components
• Receiving/Storing Documents
• Tokenization
• Generating Terms
• Indexing the Terms
Slide 72
Slide 72 text
Documents
Slide 73
Slide 73 text
Documents
• NOT a row in the DB
Slide 74
Slide 74 text
Documents
• NOT a row in the DB
• Think blob of text + metadata
Slide 75
Slide 75 text
Documents
• NOT a row in the DB
• Think blob of text + metadata
• Text quality is THE most important thing!
Slide 76
Slide 76 text
Documents
• NOT a row in the DB
• Think blob of text + metadata
• Text quality is THE most important thing!
• Flat, NOT relational!
Slide 77
Slide 77 text
Documents
• NOT a row in the DB
• Think blob of text + metadata
• Text quality is THE most important thing!
• Flat, NOT relational!
• Denormalize, denormalize,
denormalize!
Slide 78
Slide 78 text
Documents
Slide 79
Slide 79 text
Tokenization
Slide 80
Slide 80 text
Tokenization
• Using the text blob, you:
• Split on whitespace
Slide 81
Slide 81 text
Tokenization
• Using the text blob, you:
• Split on whitespace
• Lowercase
Slide 82
Slide 82 text
Tokenization
• Using the text blob, you:
• Split on whitespace
• Lowercase
• Filter out stopwords
Slide 83
Slide 83 text
Tokenization
• Using the text blob, you:
• Split on whitespace
• Lowercase
• Filter out stopwords
• Strip punctuation
Slide 84
Slide 84 text
Tokenization
• Using the text blob, you:
• Split on whitespace
• Lowercase
• Filter out stopwords
• Strip punctuation
• Etc.
Slide 85
Slide 85 text
The point is to
Normalize the tokens.
Consistent little atomic units we can assign
meaning to & work with.
Slide 86
Slide 86 text
Stemming
Slide 87
Slide 87 text
Stemming
• To avoid manually searching through the
whole blob, you tokenize
Slide 88
Slide 88 text
Stemming
• To avoid manually searching through the
whole blob, you tokenize
• More post-processing
Slide 89
Slide 89 text
Stemming
• To avoid manually searching through the
whole blob, you tokenize
• More post-processing
• THEN! you find the root word
Stemming (cont.)
• These become the terms in the inverted
index
Slide 92
Slide 92 text
Stemming (cont.)
• These become the terms in the inverted
index
• When you do the same to the query, you
can match them up
Slide 93
Slide 93 text
Stemming (cont.)
• Cons:
• Stemming only works well if you know
the grammatical structure of the
language
Slide 94
Slide 94 text
Stemming (cont.)
• Cons:
• Stemming only works well if you know
the grammatical structure of the
language
• Most are specific to English, though
other stemmers available
Slide 95
Slide 95 text
Stemming (cont.)
• Cons:
• Stemming only works well if you know
the grammatical structure of the
language
• Most are specific to English, though
other stemmers available
• Hard to make work cross-language
Slide 96
Slide 96 text
No content
Slide 97
Slide 97 text
How Do We Solve
This Shortcoming?
Let’s generate the terms from a different angle...
Slide 98
Slide 98 text
N-grams
Slide 99
Slide 99 text
N-grams
• Solves some of the shortcomings of
stemming with new tradeoffs
Slide 100
Slide 100 text
N-grams
• Solves some of the shortcomings of
stemming with new tradeoffs
• Passes a "window" over the tokenized
data
Slide 101
Slide 101 text
N-grams
• Solves some of the shortcomings of
stemming with new tradeoffs
• Passes a "window" over the tokenized
data
• These windows of data become the
terms in the index
Slide 102
Slide 102 text
N-grams (cont.)
• Examples (gram size of 3):
• hello world
• [‘hel’]
Slide 103
Slide 103 text
N-grams (cont.)
• Examples (gram size of 3):
• hello world
• [‘hel’, ‘ell’]
Slide 104
Slide 104 text
N-grams (cont.)
• Examples (gram size of 3):
• hello world
• [‘hel’, ‘ell’, ‘llo’]
Slide 105
Slide 105 text
N-grams (cont.)
• Examples (gram size of 3):
• hello world
• [‘hel’, ‘ell’, ‘llo’,
‘wor’]
Slide 106
Slide 106 text
N-grams (cont.)
• Examples (gram size of 3):
• hello world
• [‘hel’, ‘ell’, ‘llo’,
‘wor’, ‘orl’]
Slide 107
Slide 107 text
N-grams (cont.)
• Examples (gram size of 3):
• hello world
• [‘hel’, ‘ell’, ‘llo’,
‘wor’, ‘orl’, ‘rld’]
Slide 108
Slide 108 text
Edge N-grams (cont.)
• Typically used with multiple gram sizes
• Examples (gram size of 3 to 6):
• hello world
• [‘hel’]
Slide 109
Slide 109 text
Edge N-grams (cont.)
• Typically used with multiple gram sizes
• Examples (gram size of 3 to 6):
• hello world
• [‘hel’, ‘hell’]
Slide 110
Slide 110 text
Edge N-grams (cont.)
• Typically used with multiple gram sizes
• Examples (gram size of 3 to 6):
• hello world
• [‘hel’, ‘hell’, ‘hello’]
Slide 111
Slide 111 text
Edge N-grams (cont.)
• Typically used with multiple gram sizes
• Examples (gram size of 3 to 6):
• hello world
• [‘hel’, ‘hell’, ‘hello’,
‘wor’]
Slide 112
Slide 112 text
Edge N-grams (cont.)
• Typically used with multiple gram sizes
• Examples (gram size of 3 to 6):
• hello world
• [‘hel’, ‘hell’, ‘hello’,
‘wor’, ‘worl’]
Slide 113
Slide 113 text
Edge N-grams (cont.)
• Typically used with multiple gram sizes
• Examples (gram size of 3 to 6):
• hello world
• [‘hel’, ‘hell’, ‘hello’,
‘wor’, ‘worl’, ‘world’]
Slide 114
Slide 114 text
N-grams (cont.)
• Pros:
• Great for autocomplete (matches
small fragments quickly)
Slide 115
Slide 115 text
N-grams (cont.)
• Pros:
• Great for autocomplete (matches
small fragments quickly)
• Works across languages (even Asian!)
Slide 116
Slide 116 text
N-grams (cont.)
• Cons:
• Lots more terms in the index
Slide 117
Slide 117 text
N-grams (cont.)
• Cons:
• Lots more terms in the index
• Initial quality can suffer little
Slide 118
Slide 118 text
N-grams (cont.)
Slide 119
Slide 119 text
Inverted Index
Slide 120
Slide 120 text
Inverted Index
• The heart of the engine
Slide 121
Slide 121 text
Inverted Index
• The heart of the engine
• Like a dictionary
Slide 122
Slide 122 text
Inverted Index
• The heart of the engine
• Like a dictionary
• Keys matter (terms from all docs)
Slide 123
Slide 123 text
Inverted Index
• The heart of the engine
• Like a dictionary
• Keys matter (terms from all docs)
• Stores position & document IDs
Slide 124
Slide 124 text
Inverted Index
Slide 125
Slide 125 text
Segments
Slide 126
Slide 126 text
Segments
• Lots of different ways to do this
Slide 127
Slide 127 text
Segments
• Lots of different ways to do this
• Many follow Lucene
Slide 128
Slide 128 text
Segments
• Lots of different ways to do this
• Many follow Lucene
• We’re going to cheat & take a slightly
simpler approach...
Segments
• Flat files
• Hashed keys
• Always sorted
• Use JSON for the position/document
data
Slide 133
Slide 133 text
Segments
Slide 134
Slide 134 text
Searching
Slide 135
Slide 135 text
Searching
• Three Main Components
• Query Parser
Slide 136
Slide 136 text
Searching
• Three Main Components
• Query Parser
• Index Reader
Slide 137
Slide 137 text
Searching
• Three Main Components
• Query Parser
• Index Reader
• Scoring (Relevance)
Slide 138
Slide 138 text
Query Parser
Slide 139
Slide 139 text
Query Parser
• Parse out the structure
Slide 140
Slide 140 text
Query Parser
• Parse out the structure
• Process the elements the same way you
prepared the document
Slide 141
Slide 141 text
Query Parser
Slide 142
Slide 142 text
Index Reader
Slide 143
Slide 143 text
Index Reader
• Per-term, hash the term to get the right
file
Slide 144
Slide 144 text
Index Reader
• Per-term, hash the term to get the right
file
• Rip through & collect all the results of
positions/documents
Slide 145
Slide 145 text
Index Reader
Slide 146
Slide 146 text
Scoring
Slide 147
Slide 147 text
Scoring
• Reorder the collection of documents
based on how well each fits the query
Slide 148
Slide 148 text
Scoring
• Reorder the collection of documents
based on how well each fits the query
• Lots of choices
• BM25
Slide 149
Slide 149 text
Scoring
• Reorder the collection of documents
based on how well each fits the query
• Lots of choices
• BM25
• Phased
Slide 150
Slide 150 text
Scoring
• Reorder the collection of documents
based on how well each fits the query
• Lots of choices
• BM25
• Phased
• Google’s PageRank
Slide 151
Slide 151 text
Scoring
Slide 152
Slide 152 text
No content
Slide 153
Slide 153 text
Demo-time!
Slide 154
Slide 154 text
No content
Slide 155
Slide 155 text
Advanced Topics
Here be dragons...
Slide 156
Slide 156 text
Faceting
Slide 157
Slide 157 text
Faceting
• For a given field, collect all terms
Slide 158
Slide 158 text
Faceting
• For a given field, collect all terms
• Count the length of the unique
document ids for each
Slide 159
Slide 159 text
Faceting
• For a given field, collect all terms
• Count the length of the unique
document ids for each
• Order by descending count
Slide 160
Slide 160 text
Boost
Slide 161
Slide 161 text
Boost
• During the scoring process
• If a condition is met, alter the score
accordingly
Slide 162
Slide 162 text
More Like This
Slide 163
Slide 163 text
More Like This
• Collect all the terms for a given
document
Slide 164
Slide 164 text
More Like This
• Collect all the terms for a given
document
• Sort based on how many times a
document is seen in the set
Slide 165
Slide 165 text
More Like This
• Collect all the terms for a given
document
• Sort based on how many times a
document is seen in the set
• This is a simplistic view
Slide 166
Slide 166 text
More Like This
• Collect all the terms for a given
document
• Sort based on how many times a
document is seen in the set
• This is a simplistic view
• More complete solutions use NLP to
increase quality
Slide 167
Slide 167 text
microsearch
https://github.com/toastdriven/microsearch
A complete version of
everything presented here.