ElasticSearch — Your Data, Your Search (EURUKO 2011)

Your Data, Your Search Karel Minařík

ElasticSearch http://karmi.cz

Your Data, Your Search Karel Minařík and Florian Hanke

ElasticSearch Search is the primary interface for getting information today.

http://www.apple.com/macosx/what-is-macosx/spotlight.html

??? ???

# https://github.com/rubygems/gemcutter/blob/master/app/models/rubygem.rb#L29-33 # def self.search(query) where("versions.indexed and (upper(name) like
upper(:query) or upper(versions.description) like upper(:query))", {:query => "%#{query.strip}%"}). includes(:versions). order("rubygems.downloads desc") end

ElasticSearch Search (mostly) sucks. Why?

How do you implement search? WHY SEARCH SUCKS? class MyModel
include Whatever::Search end MyModel.search "something"

How do you implement search? WHY SEARCH SUCKS? class MyModel
include Whatever::Search end MyModel.search "whatever" MAGIC

How do you implement search? WHY SEARCH SUCKS? def search
@results = MyModel.search params[:q] respond_with @results end Result Results Query

def search @results = MyModel.search params[:q] respond_with @results
end Result Results Query How do you implement search? WHY SEARCH SUCKS? MAGIC

def search @results = MyModel.search params[:q] respond_with @results
end Result Results Query How do you implement search? WHY SEARCH SUCKS? MAGIC +

A personal story... 670px 23px

MyModel.search "(this OR that) AND NOT whatever" Arel::Table.new(:articles). where(articles[:title].eq('On
Search')). where(["published_on => ?", Time.now]). join(comments). on(article[:id].eq(comments[:article_id])) take(5). skip(4). to_sql Compare your search library with your ORM library WHY SEARCH SUCKS?

ElasticSearch Your data, your search.

A collection of documents HOW DOES SEARCH WORK? file_1.txt The
ruby is a pink to blood-‐red colored gemstone ... file_2.txt Ruby is a dynamic, reflective, general-‐purpose object-‐oriented programming language ... file_3.txt "Ruby" is a song by English rock band Kaiser Chiefs ...

How do you search documents? HOW DOES SEARCH WORK? File.read('file1.txt').include?('ruby')

The inverted index HOW DOES SEARCH WORK? http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt pink file_1.txt gemstone file_1.txt dynamic file_2.txt reflective file_2.txt programming file_2.txt song file_3.txt english file_3.txt rock file_3.txt

The inverted index HOW DOES SEARCH WORK? http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices ruby file_1.txt
file_2.txt file_3.txt pink file_1.txt gemstone file_1.txt dynamic file_2.txt reflective file_2.txt programming file_2.txt song file_3.txt english file_3.txt rock file_3.txt MySearchLib.search "ruby"

The inverted index HOW DOES SEARCH WORK? http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices pink file_1.txt
gemstone file_1.txt dynamic file_2.txt reflective file_2.txt programming file_2.txt song file_3.txt english file_3.txt rock file_3.txt MySearchLib.search "song" ruby file_1.txt file_2.txt file_3.txt

module SimpleSearch def index document, content
tokens = analyze content store document, tokens puts "Indexed document #{document} with tokens:", tokens.inspect, "\n" end def analyze content # >>> Split content by words into "tokens" content.split(/\W/). # >>> Downcase every word map { |word| word.downcase }. # >>> Reject stop words, digits and whitespace reject { |word| STOPWORDS.include?(word) || word =~ /^\d+/ || word == '' } end def store document_id, tokens tokens.each do |token| # >>> Save the "posting" ( (INDEX[token] ||= []) << document_id ).uniq! end end def search token puts "Results for token '#{token}':" # >>> Print documents stored in index for this token INDEX[token].each { |document| " * #{document}" } end INDEX = {} STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t extend self end A naïve Ruby implementation

SimpleSearch.index "file1", "Ruby is a language. Java is also a
language. SimpleSearch.index "file2", "Ruby is a song." SimpleSearch.index "file3", "Ruby is a stone." SimpleSearch.index "file4", "Java is a language." Indexed document file1 with tokens: ["ruby", "language", "java", "also", "language"] Indexed document file2 with tokens: ["ruby", "song"] Indexed document file3 with tokens: ["ruby", "stone"] Indexed document file4 with tokens: ["java", "language"] Indexing documents HOW DOES SEARCH WORK? Words downcased, stopwords removed.

puts "What's in our index?" p SimpleSearch::INDEX { "ruby"
=> ["file1", "file2", "file3"], "language" => ["file1", "file4"], "java" => ["file1", "file4"], "also" => ["file1"], "stone" => ["file3"], "song" => ["file2"] } The index HOW DOES SEARCH WORK?

SimpleSearch.search "ruby" Results for token 'ruby': * file1 * file2
* file3 Search the index HOW DOES SEARCH WORK?

The inverted index HOW DOES SEARCH WORK? http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices TOKENS POSTINGS
ruby file_1.txt file_2.txt file_3.txt pink file_1.txt gemstone file_1.txt dynamic file_2.txt reflective file_2.txt programming file_2.txt song file_3.txt english file_3.txt rock file_3.txt 3 1

ElasticSearch It is very practical to know how search works.
For instance, now you know that the analysis step is very important. Most of the time, it's more important than the search step.

module SimpleSearch def index document, content
tokens = analyze content store document, tokens puts "Indexed document #{document} with tokens:", tokens.inspect, "\n" end def analyze content # >>> Split content by words into "tokens" content.split(/\W/). # >>> Downcase every word map { |word| word.downcase }. # >>> Reject stop words, digits and whitespace reject { |word| STOPWORDS.include?(word) || word =~ /^\d+/ || word == '' } end def store document_id, tokens tokens.each do |token| # >>> Save the "posting" ( (INDEX[token] ||= []) << document_id ).uniq! end end def search token puts "Results for token '#{token}':" # >>> Print documents stored in index for this token INDEX[token].each { |document| " * #{document}" } end INDEX = {} STOPWORDS = %w|a an and are as at but by for if in is it no not of on or that the then there t extend self end A naïve Ruby implementation

http://search-engines-book.com Search Engines Information Retrieval in Practice Bruce Croft, Donald
Metzler and Trevor Strohma Addison Wesley, 2009 The Search Engine Textbook HOW DOES SEARCH WORK?

Lucene in Action Michael McCandless, Erik Hatcher and Otis Gospodnetic
July, 2010 The Baseline Information Retrieval Implementation SEARCH IMPLEMENTATIONS http://manning.com/hatcher3

http://elasticsearch.org

ElasticSearch HTTP JSON Schema-free Index as Resource Distributed Queries Facets
Mapping Ruby { }

ELASTICSEARCH FEATURES # Add document curl -‐X POST "http://localhost:9200/articles/article/1" -‐d
'{ "title" : "One" }' # Query curl -‐X GET "http://localhost:9200/articles/_search?q=One" curl -‐X POST "http://localhost:9200/articles/_search" -‐d '{ "query" : { "terms" : { "tags" : ["ruby", "python"], "minimum_match" : 2 } } }' # Delete index curl -‐X DELETE "http://localhost:9200/articles" # Create index with settings and mapping curl -‐X PUT "http://localhost:9200/articles" -‐d ' { "settings" : { "index" : "number_of_shards" : 3, "number_of_replicas" : 2 }}, { "mappings" : { "document" : { "properties" : { "body" : { "type" : "string", "analyzer" : "snowball" } } } } }' HTTP JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby INDEX TYPE ID

ELASTICSEARCH FEATURES # Add document curl -‐X POST "http://localhost:9200/articles/article/1" -‐d
'{ "title" : "One" }' # Query curl -‐X GET "http://localhost:9200/articles/_search?q=One" curl -‐X POST "http://localhost:9200/articles/_search" -‐d '{ "query" : { "terms" : { "tags" : ["ruby", "python"], "minimum_match" : 2 } } }' # Delete index curl -‐X DELETE "http://localhost:9200/articles" # Create index with settings and mapping curl -‐X PUT "http://localhost:9200/articles" -‐d ' { "settings" : { "index" : "number_of_shards" : 3, "number_of_replicas" : 2 }}, { "mappings" : { "document" : { "properties" : { "body" : { "type" : "string", "analyzer" : "snowball" } } } } }' HTTP JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby

HTTP JSON / Schema-free / Index as Resource / Distributed
/ Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES http { server { listen 8080; server_name search.example.com; error_log elasticsearch-‐errors.log; access_log elasticsearch.log; location / { # Deny access to Cluster API if ($request_filename ~ "_cluster") { return 403; break; } # Pass requests to ElasticSearch proxy_pass http://localhost:9200; proxy_redirect off; proxy_set_header X-‐Real-‐IP $remote_addr; proxy_set_header X-‐Forwarded-‐For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; # Authorize access auth_basic "ElasticSearch"; auth_basic_user_file passwords; # Route all requests to authorized user's own index rewrite ^(.*)$ /$remote_user$1 break; rewrite_log on; return 403; } } GET http://user:password@localhost:8080/_search?q=* => http://localhost:9200/user/_search?q=* https://gist.github.com/986390

HTTP / JSON / Schema-free / Index as Resource /
Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES { "id" : "abc123", "title" : "ElasticSearch Understands JSON!", "body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first . "published_on" : "2011/05/27 10:00:00", "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "[email protected]" } } JSON

curl -‐X DELETE "http://localhost:9200/articles"; sleep 1 curl -‐X POST
"http://localhost:9200/articles/article" -‐d ' { "id" : "abc123", "title" : "ElasticSearch Understands JSON!", "body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first . "published_on" : "2011/05/27 10:00:00", "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "[email protected]" } }' curl -‐X POST "http://localhost:9200/articles/_refresh" curl -‐X GET \ "http://localhost:9200/articles/article/_search?q=author.first_name:clara" HTTP / JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES

curl -‐X GET "http://localhost:9200/articles/_mapping?pretty=true" { "articles"
: { "article" : { "properties" : { "title" : { "type" : "string" }, // ... "author" : { "dynamic" : "true", "properties" : { "first_name" : { "type" : "string" }, // ... } }, "published_on" : { "format" : "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd", "type" : "date" } } } } } HTTP / JSON / Schema-free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES

curl -‐X POST "http://localhost:9200/articles/comment" -‐d ' {
"body" : "Wow! Really nice JSON support.", "published_on" : "2011/05/27 10:05:00", "author" : { "first_name" : "John", "last_name" : "Pear", "email" : "[email protected]" } }' curl -‐X POST "http://localhost:9200/articles/_refresh" curl -‐X GET \ "http://localhost:9200/articles/comment/_search?q=author.first_name:john" HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES

curl -‐X GET \ "http://localhost:9200/articles/comment/_search?q=body:json" curl -‐X GET \
"http://localhost:9200/articles/_search?q=body:json" curl -‐X GET \ "http://localhost:9200/articles,users/_search?q=body:json" curl -‐X GET \ "http://localhost:9200/_search?q=body:json" HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES

HTTP / JSON / Schema Free / Index as Resource
/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES curl -‐X DELETE "http://localhost:9200/articles"; sleep 1 curl -‐X POST "http://localhost:9200/articles/article" -‐d ' { "id" : "abc123", "title" : "ElasticSearch Understands JSON!", "body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s first ...", "published_on" : "2011/05/27 10:00:00", "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "[email protected]" } }' curl -‐X POST "http://localhost:9200/articles/_refresh" curl -‐X GET "http://localhost:9200/articles/article/1"

/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES {"_index":"articles","_type":"article","_id":"1","_version":1, "_source" : { "id" : "1", "title" : "ElasticSearch Understands JSON!", "body" : "ElasticSearch not only “works” with JSON, it understands it! Let’s "published_on" : "2011/05/27 10:00:00", "tags" : ["search", "json"], "author" : { "first_name" : "Clara", "last_name" : "Rice", "email" : "[email protected]" } }} The Index Is Your Database.

/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html my_alias index_A index_B curl -‐X POST 'http://localhost:9200/_aliases' -‐d ' { "actions" : [ { "add" : { "index" : "index_1", "alias" : "myalias" } }, { "add" : { "index" : "index_2", "alias" : "myalias" } } ] }' Index Aliases

/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES logs The “Sliding Window” problem logs_2010_02 logs_2010_03 logs_2010_04 curl -‐X DELETE http://localhost:9200 / logs_2010_01 “We can really store only three months worth of data.”

curl -‐X PUT localhost:9200/_template/bookmarks_template -‐d ' { "template" :
"users_*", "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 3 } }, "mappings": { "url": { "properties": { "url": { "type": "string", "analyzer": "simple", "boost": 10 }, "title": { "type": "string", "analyzer": "snowball", "boost": 5 } // ... } } } } ' HTTP / JSON / Schema Free / Index as Resource / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Apply this configuration for every matching index being created http://www.elasticsearch.org/guide/reference/api/admin-indices-templates.html Index Templates

/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Node 1 Node 2 Node 3 Node 4 MASTER Automatic Discovery Protocol http://www.elasticsearch.org/guide/reference/modules/discovery/ $ cat elasticsearch.yml cluster: name: <YOUR APPLICATION>

/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES A A1 A2 A3 A1' A2' A3' A1'' A2'' A3'' Replicas Shards curl -‐XPUT 'http://localhost:9200/A/' -‐d '{ "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 2 } } }' Index is split into 3 shards, and duplicated in 2 replicas.

/ Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES SHARDS REPLICAS Im prove indexing perform ance Im prove search perform ance

HTTP / JSON / Schema Free / Distributed / Queries
/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES Terms apple apple iphone Phrases "apple iphone" Proximity "apple safari"~5 Fuzzy apple~0.8 Wildcards app* *pp* Boosting apple^10 safari Range [2011/05/01 TO 2011/05/31] [java TO json] Boolean apple AND NOT iphone +apple -‐iphone (apple OR iphone) AND NOT review Fields title:iphone^15 OR body:iphone published_on:[2011/05/01 TO "2011/05/27 10:00:00"] http://lucene.apache.org/java/3_1_0/queryparsersyntax.html $ curl -‐X GET "http://localhost:9200/_search?q=<YOUR QUERY>"

curl -‐X POST "http://localhost:9200/articles/_search?pretty=true" -‐d ' { "query" :
{ "terms" : { "tags" : [ "ruby", "python" ], "minimum_match" : 2 } } }' HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Query DSL http://www.elasticsearch.org/guide/reference/query-dsl/ JSON

curl -‐X POST "http://localhost:9200/venues/venue" -‐d ' { "name": "Pizzeria",
"pin": { "location": { "lat": 50.071712, "lon": 14.386832 } } }' curl -‐X POST "http://localhost:9200/venues/_search?pretty=true" -‐d ' { "query" : { "filtered" : { "query" : { "query_string" : { "query" : "pizzeria" } }, "filter" : { "geo_distance" : { "distance" : "0.5km", "pin.location" : { "lat" : 50.071481, "lon" : 14.387284 } } } } } }' HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Accepted formats for Geo: [lon, lat] # Array "lat,lon" # String drm3btev3e86 # Geohash Geo Search http://www.elasticsearch.org/guide/reference/query-dsl/geo-distance-filter.html

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/ Query

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES curl -‐X POST "http://localhost:9200/articles/_search?pretty=true" -‐d ' { "query" : { "query_string" : { "query" : "title:T*"} }, "filter" : { "terms" : { "tags" : ["ruby"] } }, "facets" : { "tags" : { "terms" : { "field" : "tags", "size" : 10 } } } }' # facets" : { # "tags" : { # "terms" : [ { # "term" : "ruby", # "count" : 2 # }, { # "term" : "python", # "count" : 1 # }, { # "term" : "java", # "count" : 1 # } ] # } # } User query “Checkboxes” Facets http://www.elasticsearch.org/guide/reference/api/search/facets/index.html

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES curl -‐X POST "http://localhost:9200/articles/_search?pretty=true" -‐d ' { "facets" : { "published_on" : { "date_histogram" : { "field" : "published", "interval" : "day" } } } }'

curl -‐X POST "http://localhost:9200/venues/_search?pretty=true" -‐d ' {
"query" : { "query_string" : { "query" : "pizzeria" } }, "facets" : { "distance_count" : { "geo_distance" : { "pin.location" : { "lat" : 50.071712, "lon" : 14.386832 }, "ranges" : [ { "to" : 1 }, { "from" : 1, "to" : 5 }, { "from" : 5, "to" : 10 } ] } } } }' HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Geo Distance Facets http://www.elasticsearch.org/guide/reference/api/search/facets/geo-distance-facet.html

def analyze content # >>> Split
content by words into "tokens" content.split(/\W/). # >>> Downcase every word map { |word| word.downcase }. # ... end HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Remember? curl -‐X DELETE "http://localhost:9200/articles" curl -‐X POST "http://localhost:9200/articles/article" -‐d ' { "mappings": { "article": { "properties": { "tags": { "type": "string", "analyzer": "keyword" }, "content": { "type": "string", "analyzer": "snowball" }, "title": { "type": "string", "analyzer": "snowball", "boost": 10.0 } } } } }' curl -‐X GET 'http://localhost:9200/articles/_mapping?pretty=true' http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES curl -‐X DELETE "http://localhost:9200/urls" curl -‐X POST "http://localhost:9200/urls/url" -‐d ' { "settings" : { "index" : { "analysis" : { "analyzer" : { "url_analyzer" : { "type" : "custom", "tokenizer" : "lowercase", "filter" : ["stop", "url_stop", "url_ngram"] } }, "filter" : { "url_stop" : { "type" : "stop", "stopwords" : ["http", "https", "www"] }, "url_ngram" : { "type" : "nGram", "min_gram" : 3, "max_gram" : 5 } } } } } }' https://gist.github.com/988923

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES curl -‐X PUT localhost:9200/urls/url/_mapping -‐d ' { "url": { "properties": { "url": { "type": "string", "analyzer": "url_analyzer" } } } }' curl -‐X POST localhost:9200/urls/url -‐d '{ "url" : "http://urlaubinkroatien.de" }' curl -‐X POST localhost:9200/urls/url -‐d '{ "url" : "http://besteurlaubinkroatien.de" }' curl -‐X POST localhost:9200/urls/url -‐d '{ "url" : "http://kroatien.de" }' curl -‐X POST localhost:9200/urls/_refresh curl "http://localhost:9200/urls/_search?pretty=true&q=url:kroatien" curl "http://localhost:9200/urls/_search?pretty=true&q=url:urlaub" curl "http://localhost:9200/urls/_search?pretty=true&q=url:(urlaub AND kroatien)" https://gist.github.com/988923

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES K R O A T I E N K R O R O A O A T A T I T I E I E N Trigrams }

/ Facets / Mapping / Ruby ELASTICSEARCH FEATURES Tire.index 'articles' do delete create store :title => 'One', :tags => ['ruby'], :published_on => '2011-‐01-‐01' store :title => 'Two', :tags => ['ruby', 'python'], :published_on => '2011-‐01-‐02' store :title => 'Three', :tags => ['java'], :published_on => '2011-‐01-‐02' store :title => 'Four', :tags => ['ruby', 'php'], :published_on => '2011-‐01-‐03' refresh end s = Tire.search 'articles' do query { string 'title:T*' } filter :terms, :tags => ['ruby'] sort { title 'desc' } facet 'global-‐tags' { terms :tags, :global => true } facet 'current-‐tags' { terms :tags } end http://github.com/karmi/tire

class Article < ActiveRecord::Base include Tire::Model::Search include Tire::Model::Callbacks
end http://github.com/karmi/tire HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Article.search do query { string 'love' } facet('timeline') { date :published_on, :interval => 'month' } sort { published_on 'desc' } end $ rake environment tire:import CLASS='Article'

class Article include Whatever::ORM include Tire::Model::Search include
Tire::Model::Callbacks end http://github.com/karmi/tire HTTP / JSON / Schema Free / Distributed / Queries / Facets / Mapping / Ruby ELASTICSEARCH FEATURES Article.search do query { string 'love' } facet('timeline') { date :published_on, :interval => 'month' } sort { published_on 'desc' } end $ rake environment tire:import CLASS='Article'

$ rails new tired -‐m "https://gist.github.com/raw/951343/tired.rb" A “batteries included” installation.
Downloads and launches ElasticSearch. Sets up a Rails applicationand and launches it. When you're tired of it, just delete the folder. Try ElasticSearch and Tire with a one-line command.

Thanks! d

ElasticSearch — Your Data, Your Search (EURUKO ...

ElasticSearch — Your Data, Your Search (EURUKO 2011)

More Decks by Karel Minarik

Other Decks in Technology

Featured

Transcript