Demystifying Search

WTF is Search? @allizad Hi I’m Alli from One More
Cloud (bonsai.io, hosted Elasticsearch, and websolr.com - hosted Solr). We’re the ﬁrst “search in the cloud” hosting company - local to Austin and self-funded. This talk covers how search works on a fundamental level.

It’s helpful to think the end result for search: there’s
a text box for querying, some term a use submits, and “hits” that come back (maybe with some clever highlighting, or faceting - which is common in ecommerce).

context ﬁrst, let’s zoom out and look at some history
of search engines

where search is today • We’re standing on the shoulders
of giants • The search libraries we still use today have been around since 1999 • Apache Lucene! ❤ • There are two major open-source search engines • Solr • Elasticsearch • Plus, a bunch of proprietary options…because, capitalism Powerful search has been building for a long time. Most of what you use today is built on top of a Java search library that’s open-source, called Apache Lucene. This is important to note: because Lucene is still being hacked on today its changes coincide with changes to popular engines out there - Sold and Elasticsearch.

• 1999: Apache Lucene • Java search library • 2004:
Solr • Java • XML • New releases (v6 out) • 2004: Compass • Java • Transition project that informed the build of Elasticsearch • 2010: Elasticsearch • Java • JSON • New releases (v7 coming soon) where it came from There’s a longish history here, and many open-source solutions that are available today came out of many iterations of the same ideas of Lucene. The basics: there’s Solr, which is an older iteration and XML-based, and Elasticsearch, which is relatively newer and has gained a lot of popularity over the last few years.

types of search engines 1. Simple client-side (javascript type-ahead find)
2. Full-text search engines that speak HTTP • Open-source • Elasticsearch • Solr • Proprietary Sure, you _could_ build a search engine on the client-side if you have small data sets, but the bigger you go, the bigger your power needs to be. Kind of like static sites vs. server-backed applications. Serving a simple site with plain old html and css files, maybe some javascript? Stick to S3. Need to hook it up to a database and add in lots of records, plus do it with redundancy? You need more firepower.

search architecture

app architecture sans search Application, which runs on a box,
and a db, which is a server in and of itself, runs separately, and talks to your app via HTTP (with credentials)

app architecture with search Likewise, search engines speak HTTP and
talk to your app with calls and responses using a client. There’s a plethora of client libraries: https://www.elastic.co/guide/en/elasticsearch/client/index.html

two main parts to using search engines 1. Indexing writing
data to a search engine 2. Querying reading the engine, i.e. sending a query to the engine and getting back results

how search engines work

hint: storyofsearch.com

the magical inverted index • A search engine isn’t a
database, but it’s kind of like one. • On a basic level, search engines work the same way as a book index. • A sorted, inverted index is used to look up terms via binary search.

indexing: a transformation of the database

Database retrieval “Give me the thing I know about” appname.com/foods/1
id name ﬁeld1 ﬁeld2 1 Chicken pie … … 2 Chicken soup … … 3 Tortilla soup … … 4 Chicken fried chicken … … 5 Fried tacos … …

Food.where(name: "Chicken fried chicken") #=> [ { id: 4, name:
"Chicken fried chicken", field1: "foo", field2: "baz" } ]

Give me the thing I don’t know about. Search retrieval

rake environment elasticsearch:import:model CLASS=‘Food'

id name ﬁeld1 ﬁeld2 1 Chicken pie … … 2
Chicken soup … … 3 Tortilla soup … … 4 Chicken fried chicken … … 5 Fried tacos … … Database Terms Chicken pie Chicken soup Tortilla soup Chicken fried chicken Fried tacos “chicken” “pie” “chicken” “soup” “tortilla” “soup” “chicken” “chicken” “fried” “fried” “tacos” Fields we want to search Indexing Process

Sorted term dictionary postings chicken <1, 1>, <2, 1>, <4,
2> fried <4, 1>, <5, 1> pie <1, 1> soup <2, 1>, <3, 1> tacos <5, 1> tortilla <3, 1> Inverted Index

Database lookup appname.com/recipes/1 databases and search engines are designed for
inherently different operations Search engine/index lookup Q = “chicken” A = <1,1>, <2,1>, <4,2> id name field1 field2 1 Chicken pie … … 2 Chicken soup … … 3 Tortilla soup … … 4 Chicken fried chicken … … 5 Fried tacos … … Sorted term dictionary postings chicken <1, 1>, <2, 1>, <4, 2> fried <4, 1>, <5, 1> pie <1, 1> soup <2, 1>, <3, 1> tacos <5, 1> tortilla <3, 1> The index can contain doc ids and counts in that doc for that term. And more fancy stuff as well such as payloads in lucene. <id, term-frequency, doc-frequency> tf/idf scoring. term frequency–inverse document frequency Dictionary & postings Elasticsearch ships with BM25 as the standard scoring mechanism, but you can set your own scoring algorithms, or do things like boosting.

Q: “What are all the foods with chicken in them?”
A: <1,1>, <2,1>, <4,2> Term dictionary postings chicken <1, 1>, <2, 1>, <4, 2> fried <4, 1>, <5, 1> pie <1, 1> soup <2, 1>, <3, 1> tacos <5, 1> tortilla <3, 1>

require 'elasticsearch' client = Elasticsearch::Client.new log: true client.search q: ‘chicken’
#=> hits: [ { id: 4, name: "Chicken fried chicken", field1: "foo", field2: "baz" }, { id: 1, name: "Chicken pie", field1: "bar", field2: "blitz" }, { id: 2, name: "Chicken soup", field1: "blah", field2: "fuzz" } ... ]

intermission no 2 Clone this repo, get coffee, stretch your
legs: https://github.com/allizad/react-search-example

git clone https://github.com/allizad/react-search-example npm install (or yarn install) npm start
cd react-search-example Make sure you’re on master Git co add-search to see the full example

bonsai.io websolr.com storyofsearch.com @allizad [email protected]

Demystifying Search

Demystifying Search

Allison Zadrozny

More Decks by Allison Zadrozny

Other Decks in Technology

Featured

Transcript

WTF is Search? @allizad Hi I’m Alli from One More

It’s helpful to think the end result for search: there’s

context ﬁrst, let’s zoom out and look at some history

where search is today • We’re standing on the shoulders

• 1999: Apache Lucene • Java search library • 2004:

types of search engines 1. Simple client-side (javascript type-ahead ﬁnd)

search architecture

app architecture sans search Application, which runs on a box,

app architecture with search Likewise, search engines speak HTTP and

two main parts to using search engines 1. Indexing writing

how search engines work

hint: storyofsearch.com

the magical inverted index • A search engine isn’t a

indexing: a transformation of the database

Database retrieval “Give me the thing I know about” appname.com/foods/1

Food.where(name: "Chicken fried chicken") #=> [ { id: 4, name:

Give me the thing I don’t know about. Search retrieval

rake environment elasticsearch:import:model CLASS=‘Food'

id name ﬁeld1 ﬁeld2 1 Chicken pie … … 2

Sorted term dictionary postings chicken <1, 1>, <2, 1>, <4,

Database lookup appname.com/recipes/1 databases and search engines are designed for

Q: “What are all the foods with chicken in them?”

require 'elasticsearch' client = Elasticsearch::Client.new log: true client.search q: ‘chicken’

ﬁn

intermission no 2 Clone this repo, get coffee, stretch your

git clone https://github.com/allizad/react-search-example npm install (or yarn install) npm start

bonsai.io websolr.com storyofsearch.com @allizad [email protected]