The Ghost in the Search Machine

Doug Turnbull, Search Relevance & Personalization Consultant http://opensourceconnections.com @softwaredoug The
Ghost in The Search Machine

I’m now available in book form! https://www.manning.com/books/relevant-search Discount code: turnbullmu
(38% off)

Why Search •  What does a dedicated search engine do?
o  that a database doesn’t? •  Why not [MySQL|mongoDB|Cassandra | etc]?

Srawman: Why Not MySQL? •  Our mission: Find all the
“Darth Vader” in SciFi StackExchange Posts! OpenSource Connections P U C V Body 0 1 2 1 What exactly did Obiwan know about Anakin and Darth Vader before a New Hope started? 1 2 2 5 Been meaning to read the Foundation Series, what should I read ﬁrst? Found! Missing!

Why not MySQL – SQL Like? •  SQL “LIKE” operator
– scan all rows for a specific wildcard match SELECT * FROM posts WHERE body LIKE "%darth vader%" OpenSource Connections Match? Match? Match? Match? Performs Table Scan Approx 300ms to search a measly 20K docs! (what if we had 20 Million?)

SQL Like | CTRL+F | grep is 1.  Extremely Slow
-- scan! yuck! 2.  Not fuzzy -- Needs exact literal matches, no fuzziness! 3.  Unranked -- Simply says y/n whether there is a match. But what's relevant? OpenSource Connections

Elasticsearch We've all done Elasticsearch 101: curl –XPUT http://localhost:9200/stackexchange
curl –XPUT http://localhost:9200/stackexchange/post/1 -‐d ‘{ “Body”: “Darth Vader dined with Luke”, “Title”: “...”}’ OpenSource Connections

But what is being built? The answer can be found
in your textbook… OpenSource Connections Book Index: •  Topics -> page no •  Very efficient tool – compare to scanning the whole book! Lucene uses an index: •  Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7]

Computers == Dumb •  Humans are smart o  I see
“cat” or “cats” in the back of a book, no duh – jump to page 9 •  Computers are dumb, o  “CAT” != “cat” – “cat” != “cats” – no match returned OpenSource Connections

Normalization aka Text Analysis OpenSource Connections curl -‐XGET 'http://localhost:9200/_analyze?analyzer=snowball' -‐d
'Darth Vader dined with Luke' { "tokens": [ { "end_offset": 5, "position": 1, "start_offset": 0, "token": "darth", "type": "<ALPHANUM>" }, { "end_offset": 11, "position": 2, "start_offset": 6, "token": "vader", "type": "<ALPHANUM>" }, { "end_offset": 17, "position": 3, "start_offset": 12, "token": "dine", "type": "<ALPHANUM>" }, { "end_offset": 27, "position": 5, "start_offset": 23, "token": "luke", "type": "<ALPHANUM>" } ] }

What is being built? OpenSource Connections field Body term
darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ “Body”: “Darth Vader dined with Luke”, “Title”: “...”}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ “Body”: “We love Darth”, “Title”: “...”}’

Ranking OpenSource Connections field Body term darth
doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> curl –XPUT http://localhost:9200/stackexchange/post/1 –d ‘{ “Body”: “Darth Vader dined with Luke”, “Title”: “...”}’ curl –XPUT http://localhost:9200/stackexchange/post/2 –d ‘{ “Body”: “We love Darth”, “Title”: “...”}’ Can we store anything here to help decide how relevant this term is for this doc? Yes! -  Term Frequency -  How much “darth” is in this doc? -  Position within document -  Helps when we search for the phrase “darth vader”

Query Documents •  When did Darth Vader and Luke have
dinner? OpenSource Connections curl -‐X POST "http://localhost:9200/stackexchange/_search? pretty=true" -‐d ' { "query": { "match": { "Body": "luke darth dinner" } } } User Query

What happens when we query? OpenSource Connections luke darth dinner
field Body term darth doc 1 <metadata> doc 2 <metadata> term vader doc 1 <metadata> term dine doc 1 <metadata> How to consult index for matches? Analysis [luke] [darth] [dine] [darth] [dine] ... Score for [darth] docs (1 and 2) Score for [dine] docs (1) Return sorted docs client

time to be an idiot!! STUPID SEARCH TRICKS!!! OpenSource Connections

The Ghost in the Search Machine

The Ghost in the Search Machine

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

Doug Turnbull, Search Relevance & Personalization Consultant http://opensourceconnections.com @softwaredoug The

I’m now available in book form! https://www.manning.com/books/relevant-search Discount code: turnbullmu

Why Search •  What does a dedicated search engine do?

Srawman: Why Not MySQL? •  Our mission: Find all the

Why not MySQL – SQL Like? •  SQL “LIKE” operator

SQL Like | CTRL+F | grep is 1.  Extremely Slow

Elasticsearch We've all done Elasticsearch 101: curl –XPUT http://localhost:9200/stackexchange

But what is being built? The answer can be found

Computers == Dumb •  Humans are smart o  I see

Normalization aka Text Analysis OpenSource Connections curl -‐XGET 'http://localhost:9200/_analyze?analyzer=snowball' -‐d

What is being built? OpenSource Connections field Body term

Ranking OpenSource Connections field Body term darth

Query Documents •  When did Darth Vader and Luke have

What happens when we query? OpenSource Connections luke darth dinner

time to be an idiot!! STUPID SEARCH TRICKS!!! OpenSource Connections