An Introduction to Elastic Search

AN INTRODUCTION TO Elastic Search Ankit Bahuguna TERADATA eCircle DE,
Munich {Ankit.Bahuguna [AT] Teradata.com} 03.03.2014

Agenda • What is Elastic Search? • Overview • INPUT
to the database: curl –XPUT • _search endpoint • Simple Query Format • Fields attribute – Specifying which field to search within. • Filtered Queries • Mappings • Example Queries • What more can be done?

What is Elastic Search? • Elastic search is a search
server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. • Elastic search is developed in Java and is released as open source under the terms of the Apache License. • Elastic search can be used to search all kinds of documents. It provides scalable search, has near real-time search, and supports multitenancy.

Overview • "Elastic Search is distributed, which means that indices
can be divided into shards and each shard can have zero or more replicas. • Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically ". • It uses Lucene and tries to make all features of it available through the JSON and Java API. It supports facetting and percolating, which can be useful for notifying if new documents match for registered queries. • Another feature is called 'Gateway' and handles the long term persistence of the index- i.e. an index can be recovered from the Gateway in a case of a server crash. • Elasticsearch supports real-time GET requests, which makes it suitable as a NoSQL solution, but it lacks distributed transactions.

INPUT - Populating our database curl -XPUT "http://localhost:9200/movies/movie/1" -d' {
"title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' curl -XPUT "http://localhost:9200/movies/movie/2" -d' { "title": "Lawrence of Arabia", "director": "David Lean", "year": 1962, "genres": ["Adventure", "Biography", "Drama"] }'

INPUT (Continued) curl -XPUT "http://localhost:9200/movies/movie/3" - d' { "title": "To
Kill a Mockingbird", "director": "Robert Mulligan", "year": 1962, "genres": ["Crime", "Drama", "Mystery"] }' curl -XPUT "http://localhost:9200/movies/movie/4" - d' { "title": "Apocalypse Now", "director": "Francis Ford Coppola", "year": 1979, "genres": ["Drama", "War"] }' curl -XPUT "http://localhost:9200/movies/movie/5" - d' { "title": "Kill Bill: Vol. 1", "director": "Quentin Tarantino", "year": 2003, "genres": ["Action", "Crime", "Thriller"] }' curl -XPUT "http://localhost:9200/movies/movie/6" - d' { "title": "The Assassination of Jesse James by the Coward Robert Ford", "director": "Andrew Dominik", "year": 2007, "genres": ["Biography", "Crime", "Drama"] }'

_search endpoint We make requests to an URL following this
pattern: <index>/<type>/_search where index and type are both optional. In other words, in order to search for our movies we can make POST requests to either of the following URLs: • http://localhost:9200/_search - Search across all indexes and all types. • http://localhost:9200/movies/_search - Search across all types in the movies index. • http://localhost:9200/movies/movie/_search - Search explicitly for documents of type movie within the movies index.

Simple Query Format This query searches for "kill" in the
entire indexed database. Let's try a search for the word "kill" which is present in the title of two of our movies: curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } } }'

"fields" Attribute - Specifies Fields to search in This query
searches for "ford" in a particular category or column ("title") using the attribute "fields“: curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "ford", "fields": ["title"] } } }'

Filtered Query It has two properties: query and filter. When
executed it filters the result of the query using a filter. For this simple case, where a certain field should match a specific value a term filter will work well. curl -XPOST "http://localhost:9200/_search" -d' { "query": { "filtered": { "query": { "query_string": { "query": "drama" } }, "filter": { "term": { "year": 1962 } } } } }'

Filtering without a query [1] Say if we just want
to apply the filter i.e., we want movies matching a certain criteria. Solution 1: Replace the query string query in the filtered query with a "match_all" query which is a query that simply matches everything. curl -XPOST "http://localhost:9200/_search" -d' { "query": { "filtered": { "query": { "match_all": { } }, "filter": { "term": { "year": 1962 } } } } }'

Filtering without a query [2] Solution 2: "constant_score" query. curl
-XPOST "http://localhost:9200/_search" -d' { "query": { "constant_score": { "filter": { "term": { "year": 1962 } } } } }'

Mappings This is a Problematic Query: curl -XPOST "http://localhost:9200/_search" –d'
{ "query": { "constant_score": { "filter": { "term": { "director": "Francis Ford Coppola" } } } } }’ What is the problem ? On querying the total hits are “zero" even though we have indexed two movies with "Francis Ford Coppola" as director.

Reason: INDEX is different than the returned _source data •
While Elastic Search has a JSON object with that data that it returns to us in search results in the form of the _source property that's not what it has in its index. • When we index a document with Elastic Search, it does two things: 1. Stores the original data untouched for later retrieval in the form of _source 2. Indexes each JSON property into one or more fields in a Lucene index. • During the indexing it processes each field according to how the field is mapped. If it isn't mapped default mappings depending on the fields type (string, number etc.) is used. • As we haven't supplied any mappings for our index, Elastic Search uses the default mappings for strings for the director field. This means that in the index the director fields value isn't "Francis Ford Coppola". Instead it's something more like [“Francis", “Ford", “Coppola"]. • This can be verified by modifying our filter to instead match “Francis" (or “Ford" or “Coppola"): We get two hits!

Solution [1] We modify how it is mapped: There are
a number of ways to add mappings to Elastic Search, through a configuration file, as part of a HTTP request that creates and index and by calling the _mapping endpoint. Therefore, we add a mapping for the "director" field instructing Elastic Search not to analyze (tokenize etc.) the field at all when indexing it, like this: curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "string", "index": "not_analyzed" } } } }'

Problems with Solution [1] 1. It wont work as there
already is a mapping for the field: REQUEST FAILED ERROR. 2. In many cases its not possible to modify existing mappings. Workaround: To create a new index with the desired mappings and re-index all of the data into the new index. 3. Even if we could add it, we would have limited our ability to search in the director field. That is, while a search for the exact value in the field would match we wouldn't be able to search for single words in the field.

SIMPLE SOLUTION [2]: Upgrading the field to a multi-field. We'll
map the field multiple times for indexing. Given that one of the ways we map it match the existing mapping both by name and settings that will work fine and we won't have to create a new index. curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } } } } }'

WHAT DID WE DO? We told Elastic Search that whenever
it sees a property named "director" in a movie document that is about to be indexed in the movies index it should index it multiple times. Once into a field with the same name “director” and once into a field named "director.original" and the latter field should not be analyzed, maintaining the original value allowing is to filter by the exact director name.

Final Execution With our new shiny mapping in place we
can re-index one or both of the movies directed by Francis Ford Coppola (copy from the list of initial indexing requests) and try the search request that filtered by author again. Only, this time we don't filter on the "director" field (which is indexed the same way as before) but instead on the "director.original" field: curl -XPOST "http://localhost:9200/_search" -d' { "query": { "constant_score": { "filter": { "term": { "director.original": "Francis Ford Coppola" } } } } }'

Ex 1. Filtered Query with Range GET _search { "filter"
: { "and" : [ { "range" : { "income" : { "from" : "1000", "to" : "2000" } } }, { "prefix" : { "name" : "ma" } } ] } }

Ex 2. Boolean NOT GET _search { "query": { "query_string":
{ "query": "NOT Patrick" } } }

Ex 3. Complex Boolean Query GET _search { "query": {
"query_string" : { "query" : "name:patrick OR name:Thomas OR name:matthias AND (IDE:Eclipse OR IDE:NetBeans)", "use_dis_max" : true } } }

More Stuff that we can do with Elastic Search •
We can create search requests where we specify how many hits we want to use highlighting. • Get spelling suggestions and much more. • Also, the query DSL contains many interesting queries and filters that we can use. • A whole range of facets that we can use to extract statistics from our data or build navigations. • We can go far beyond the simple mapping example we've seen here to accomplish wonderful and interesting things. • Performance optimizations and considerations. • Functionality to find similar content.

Thank You! Questions?

An Introduction to Elastic Search

An Introduction to Elastic Search

Ankit Bahuguna

More Decks by Ankit Bahuguna

Other Decks in Programming

Featured

Transcript

AN INTRODUCTION TO Elastic Search Ankit Bahuguna TERADATA eCircle DE,

Agenda • What is Elastic Search? • Overview • INPUT

What is Elastic Search? • Elastic search is a search

Overview • "Elastic Search is distributed, which means that indices

INPUT - Populating our database curl -XPUT "http://localhost:9200/movies/movie/1" -d' {

INPUT (Continued) curl -XPUT "http://localhost:9200/movies/movie/3" - d' { "title": "To

_search endpoint We make requests to an URL following this

Simple Query Format This query searches for "kill" in the

"fields" Attribute - Specifies Fields to search in This query

Filtered Query It has two properties: query and filter. When

Filtering without a query [1] Say if we just want

Filtering without a query [2] Solution 2: "constant_score" query. curl

Mappings This is a Problematic Query: curl -XPOST "http://localhost:9200/_search" –d'

Reason: INDEX is different than the returned _source data •

Solution [1] We modify how it is mapped: There are

Problems with Solution [1] 1. It wont work as there

SIMPLE SOLUTION [2]: Upgrading the field to a multi-field. We'll

WHAT DID WE DO? We told Elastic Search that whenever

Final Execution With our new shiny mapping in place we

Ex 1. Filtered Query with Range GET _search { "filter"

Ex 2. Boolean NOT GET _search { "query": { "query_string":

Ex 3. Complex Boolean Query GET _search { "query": {

More Stuff that we can do with Elastic Search •

Thank You! Questions?