An Introduction to Elastic Search

Slide 1

Slide 1 text

AN INTRODUCTION TO Elastic Search Ankit Bahuguna TERADATA eCircle DE, Munich {Ankit.Bahuguna [AT] Teradata.com} 03.03.2014

Slide 2

Slide 2 text

Agenda • What is Elastic Search? • Overview • INPUT to the database: curl –XPUT • _search endpoint • Simple Query Format • Fields attribute – Specifying which field to search within. • Filtered Queries • Mappings • Example Queries • What more can be done?

Slide 3

Slide 3 text

What is Elastic Search? • Elastic search is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. • Elastic search is developed in Java and is released as open source under the terms of the Apache License. • Elastic search can be used to search all kinds of documents. It provides scalable search, has near real-time search, and supports multitenancy.

Slide 4

Slide 4 text

Overview • "Elastic Search is distributed, which means that indices can be divided into shards and each shard can have zero or more replicas. • Each node hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s). Rebalancing and routing are done automatically ". • It uses Lucene and tries to make all features of it available through the JSON and Java API. It supports facetting and percolating, which can be useful for notifying if new documents match for registered queries. • Another feature is called 'Gateway' and handles the long term persistence of the index- i.e. an index can be recovered from the Gateway in a case of a server crash. • Elasticsearch supports real-time GET requests, which makes it suitable as a NoSQL solution, but it lacks distributed transactions.

Slide 5

Slide 5 text

INPUT - Populating our database curl -XPUT "http://localhost:9200/movies/movie/1" -d' { "title": "The Godfather", "director": "Francis Ford Coppola", "year": 1972, "genres": ["Crime", "Drama"] }' curl -XPUT "http://localhost:9200/movies/movie/2" -d' { "title": "Lawrence of Arabia", "director": "David Lean", "year": 1962, "genres": ["Adventure", "Biography", "Drama"] }'

Slide 6

Slide 6 text

INPUT (Continued) curl -XPUT "http://localhost:9200/movies/movie/3" - d' { "title": "To Kill a Mockingbird", "director": "Robert Mulligan", "year": 1962, "genres": ["Crime", "Drama", "Mystery"] }' curl -XPUT "http://localhost:9200/movies/movie/4" - d' { "title": "Apocalypse Now", "director": "Francis Ford Coppola", "year": 1979, "genres": ["Drama", "War"] }' curl -XPUT "http://localhost:9200/movies/movie/5" - d' { "title": "Kill Bill: Vol. 1", "director": "Quentin Tarantino", "year": 2003, "genres": ["Action", "Crime", "Thriller"] }' curl -XPUT "http://localhost:9200/movies/movie/6" - d' { "title": "The Assassination of Jesse James by the Coward Robert Ford", "director": "Andrew Dominik", "year": 2007, "genres": ["Biography", "Crime", "Drama"] }'

Slide 7

Slide 7 text

_search endpoint We make requests to an URL following this pattern: //_search where index and type are both optional. In other words, in order to search for our movies we can make POST requests to either of the following URLs: • http://localhost:9200/_search - Search across all indexes and all types. • http://localhost:9200/movies/_search - Search across all types in the movies index. • http://localhost:9200/movies/movie/_search - Search explicitly for documents of type movie within the movies index.

Slide 8

Slide 8 text

Simple Query Format This query searches for "kill" in the entire indexed database. Let's try a search for the word "kill" which is present in the title of two of our movies: curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "kill" } } }'

Slide 9

Slide 9 text

"fields" Attribute - Specifies Fields to search in This query searches for "ford" in a particular category or column ("title") using the attribute "fields“: curl -XPOST "http://localhost:9200/_search" -d' { "query": { "query_string": { "query": "ford", "fields": ["title"] } } }'

Slide 10

Slide 10 text

Filtered Query It has two properties: query and filter. When executed it filters the result of the query using a filter. For this simple case, where a certain field should match a specific value a term filter will work well. curl -XPOST "http://localhost:9200/_search" -d' { "query": { "filtered": { "query": { "query_string": { "query": "drama" } }, "filter": { "term": { "year": 1962 } } } } }'

Slide 11

Slide 11 text

Filtering without a query [1] Say if we just want to apply the filter i.e., we want movies matching a certain criteria. Solution 1: Replace the query string query in the filtered query with a "match_all" query which is a query that simply matches everything. curl -XPOST "http://localhost:9200/_search" -d' { "query": { "filtered": { "query": { "match_all": { } }, "filter": { "term": { "year": 1962 } } } } }'

Slide 12

Slide 12 text

Filtering without a query [2] Solution 2: "constant_score" query. curl -XPOST "http://localhost:9200/_search" -d' { "query": { "constant_score": { "filter": { "term": { "year": 1962 } } } } }'

Slide 13

Slide 13 text

Mappings This is a Problematic Query: curl -XPOST "http://localhost:9200/_search" –d' { "query": { "constant_score": { "filter": { "term": { "director": "Francis Ford Coppola" } } } } }’ What is the problem ? On querying the total hits are “zero" even though we have indexed two movies with "Francis Ford Coppola" as director.

Slide 14

Slide 14 text

Reason: INDEX is different than the returned _source data • While Elastic Search has a JSON object with that data that it returns to us in search results in the form of the _source property that's not what it has in its index. • When we index a document with Elastic Search, it does two things: 1. Stores the original data untouched for later retrieval in the form of _source 2. Indexes each JSON property into one or more fields in a Lucene index. • During the indexing it processes each field according to how the field is mapped. If it isn't mapped default mappings depending on the fields type (string, number etc.) is used. • As we haven't supplied any mappings for our index, Elastic Search uses the default mappings for strings for the director field. This means that in the index the director fields value isn't "Francis Ford Coppola". Instead it's something more like [“Francis", “Ford", “Coppola"]. • This can be verified by modifying our filter to instead match “Francis" (or “Ford" or “Coppola"): We get two hits!

Slide 15

Slide 15 text

Solution [1] We modify how it is mapped: There are a number of ways to add mappings to Elastic Search, through a configuration file, as part of a HTTP request that creates and index and by calling the _mapping endpoint. Therefore, we add a mapping for the "director" field instructing Elastic Search not to analyze (tokenize etc.) the field at all when indexing it, like this: curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "string", "index": "not_analyzed" } } } }'

Slide 16

Slide 16 text

Problems with Solution [1] 1. It wont work as there already is a mapping for the field: REQUEST FAILED ERROR. 2. In many cases its not possible to modify existing mappings. Workaround: To create a new index with the desired mappings and re-index all of the data into the new index. 3. Even if we could add it, we would have limited our ability to search in the director field. That is, while a search for the exact value in the field would match we wouldn't be able to search for single words in the field.

Slide 17

Slide 17 text

SIMPLE SOLUTION [2]: Upgrading the field to a multi-field. We'll map the field multiple times for indexing. Given that one of the ways we map it match the existing mapping both by name and settings that will work fine and we won't have to create a new index. curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d' { "movie": { "properties": { "director": { "type": "multi_field", "fields": { "director": {"type": "string"}, "original": {"type" : "string", "index" : "not_analyzed"} } } } } }'

Slide 18

Slide 18 text

WHAT DID WE DO? We told Elastic Search that whenever it sees a property named "director" in a movie document that is about to be indexed in the movies index it should index it multiple times. Once into a field with the same name “director” and once into a field named "director.original" and the latter field should not be analyzed, maintaining the original value allowing is to filter by the exact director name.

Slide 19

Slide 19 text

Final Execution With our new shiny mapping in place we can re-index one or both of the movies directed by Francis Ford Coppola (copy from the list of initial indexing requests) and try the search request that filtered by author again. Only, this time we don't filter on the "director" field (which is indexed the same way as before) but instead on the "director.original" field: curl -XPOST "http://localhost:9200/_search" -d' { "query": { "constant_score": { "filter": { "term": { "director.original": "Francis Ford Coppola" } } } } }'

Slide 20

Slide 20 text

Ex 1. Filtered Query with Range GET _search { "filter" : { "and" : [ { "range" : { "income" : { "from" : "1000", "to" : "2000" } } }, { "prefix" : { "name" : "ma" } } ] } }

Slide 21

Slide 21 text

Ex 2. Boolean NOT GET _search { "query": { "query_string": { "query": "NOT Patrick" } } }

Slide 22

Slide 22 text

Ex 3. Complex Boolean Query GET _search { "query": { "query_string" : { "query" : "name:patrick OR name:Thomas OR name:matthias AND (IDE:Eclipse OR IDE:NetBeans)", "use_dis_max" : true } } }

Slide 23

Slide 23 text

More Stuff that we can do with Elastic Search • We can create search requests where we specify how many hits we want to use highlighting. • Get spelling suggestions and much more. • Also, the query DSL contains many interesting queries and filters that we can use. • A whole range of facets that we can use to extract statistics from our data or build navigations. • We can go far beyond the simple mapping example we've seen here to accomplish wonderful and interesting things. • Performance optimizations and considerations. • Functionality to find similar content.

Slide 24

Slide 24 text

Thank You! Questions?