Unraveling Elasticsearch queries - How to create a "intelligent" search

Unraveling Elasticsearch queries How to create a "intelligent" search

Who am I? @guilhermeguitte • Leroy Merlin Brasil. • Co-organizer
Laravel Meetup in São Paulo. • Software Developer. • Scrum Master. http://www.guitte.org

Before to get digging into elasticsearch...

What is "elasticsearch"?

• Real-Time Data • Real-Time Advanced Analytics • Massively Distributed
• High Availability • Multitenancy • Full-Text Search • Document-Oriented • Schema-Free • Developer-Friendly, RESTful API • Per-Operation Persistence • Build on top of Apache Lucene™

What is a "index"?

It's like a database in traditional relational database.

GET http://localhost:9200/web/orders/_search index

What is a "type"?

GET http://localhost:9200/web/orders/_search type

What is a "inverted index"?

It's like...

What you learned? • Basic jargon of elasticsearch. • What
is a index. • What is a type. • What it is elasticsearch.

Now, with the basic jargon of Elasticsearch...

Be ready!

Queries

Basic structure

{ "query": {} } GET http://localhost:9200/web/orders/_search

Structured search

"Finding for documents that exactly match with query"

The result will be "YES" or "NO".

SELECT * FROM orders WHERE status = "received"

{ "query": { "term" : { "status": "received" } }
} GET http://localhost:9200/web/orders/_search { "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }

{ "query": { "constant_score" : { "filter": { "term" :
{ "status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }

To start simple, but real world is not.

Get used yourself with "bool" queries

{ "query": { "bool" : { "must" : [], "should"
: [], "must_not" : [], "filter": [] } } } GET http://localhost:9200/web/orders/_search

{ "query": { "bool" : { "must": { "term": {
"status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }

{ "query": { "constant_score": { "filter": { "bool" : {
"must": { "term": { "status": "received" } } } } } } } GET http://localhost:9200/web/orders/_search { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

"Bool" structure is very flexible

{ "query": { "constant_score": { "filter": { "bool" : {
"should": [ { "term": { "status": "received" } }, { "bool": { "must": { "term": { "customer": "Prof. Shaylee Greenholt" } } } } ] } } } } GET http://localhost:9200/web/orders/_search { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

What types of queries elasticsearch have?

• Term • Terms • Range • Exists • Missing
• Prefix • Wildcard • Regexp • Fuzzy • Type • Ids { "term": { "status": "received" }} { "terms": { "status": ["received", "delivering"] }} { "range": { "total": {"gte": 100.5, "lte": 140.5 }}} { "exists": { "field": "region" }} { "missing": { "field": "region" }} { "prefix": { "customer": "Dolly" }} { "wildcard": { "customer": "Doll*" }} { "regexp": { "customer": { "value": "Doll*" }}} { "fuzzy": { "customer": { "value": "Doll*", "fuzziness": 2 }}} { "type": { "value": "orders"}} { "ids": { "type": "orders", "values": ["1", "2"]}}

All queries could boost the scoring if you like, but
if all are inside of "constant_score", scoring will not be calculated.

Structure queries is good for: • Filter documents before to
run queries that you would like to score your documents. • It's fast because Elasticsearch can cache them and reuse about time.

What we learned? • Structure search • Boolean match with
document. • Use bool queries. • Filter documents before runs full-text search.

Full-text search

Two different things about full-text search

Relevance

"How well which document match this query"

TF/IDF (Term freq./Inverted Document Freq.)

Proximity to a geolocation Fuzzy similarity ...

_score max_score

The simplest query

{ "query": { "match": { "customer": "John" } } }
{ "_score": 4.6189003, "customer": "John Upton" }, { "_score": 4.6189003, "customer": "John Borer" }, { "_score": 4.6189003, "customer": "John Emard" }, { "_score": 4.06103, "customer": "John Runolfsdottir IV" }, { "_score": 3.8275056, "customer": "Mr. John Cartwright III" }, { "_score": 3.8275056, "customer": "John Hodkiewicz DDS" } GET http://localhost:9200/web/orders/_search

{ "query": { "match": { "customer": "Joh" } } }
{ "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } } GET http://localhost:9200/web/orders/_search

Elasticsearch persist the data in a different way what you
are accustomed.

To understand full-text search in Elasticsearch, first you need to
understand how elasticsearch persist your data.

It's called "Analysis"

Is a pipeline that begins with:

Is a pipeline that begins with: • Create mapping ("Schema")
for the web index. (if it's not). • Receive the document from Index API. • Iterate each field and sees if the field are analyzed. • Then run analyzer for this field. • Persist the data.

The document { "customer": "Dr. Emiliano, the Mitchell Sr.", "items":
[ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719 }

"customer": "Dr. Emiliano, the Mitchell Sr." The document

"Dr. Emiliano, the Mitchell Sr." The document Analyzer: Tokenizer: whitespace
Token filter: lowercase Char Filter: html_strip

"Dr. Emiliano, the Mitchell Sr." Char filter Analyzer: Tokenizer: whitespace
Token filter: lowercase Char Filter: html_strip "Dr. Emiliano, the Mitchell Sr." string

"Dr. Emiliano, the Mitchell Sr." Tokenizer Analyzer: Tokenizer: whitespace Token
filter: lowercase Char Filter: html_strip "Dr." "Emiliano," "the" "Mitchell" "Sr." tokens

Token filter Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter:
html_strip "dr." "emiliano," "the" "mitchell" "sr." "Dr." "Emiliano," "the" "Mitchell" "Sr."

1 1 1 1 1 Persist it "dr." "emiliano," "the"
"mitchell" "sr." Token Doc ID

"mitchell" "sr." Token Doc ID TF/IDF (Term Freq/Inverted Document Freq) that generates the score.

{ "query": { "match": { "customer": "Mitchell" } } }
GET http://localhost:9200/web/orders/_search

"Mitchell" The term Analyzer: Tokenizer: whitespace Token filter: lowercase Char
Filter: html_strip

"Mitchell" The term Analyzer: Tokenizer: whitespace Token filter: lowercase Char
Filter: html_strip "mitchell"

"mitchell" "sr." Token Doc ID "mitchell"

What types of queries elasticsearch have?

• Match • multi_match • ... { "match": { "name":
"John" }} { "multi_match": { "query": "John", "fields": ["name.raw", "name.autocomplete"]}}

• Understand analysis process is must to understand how to
search. • Analyzer are compound with: tokenizers, token_filters and char filters. • You need to understand how user will search to make the right query into elasticsearch. What we learned?

References • https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html • https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html • https://www.youtube.com/playlist?list=PLZ4puV97Zwm2fEmTLrPsP7QgLsjnnQggX • Official PHP
elasticsearch package: https://github.com/elastic/elasticsearch-php • https://github.com/sleimanx2/plastic

Thanks! @guilhermeguitte http://www.guitte.org

Unraveling Elasticsearch queries - How to creat...

Unraveling Elasticsearch queries - How to create a "intelligent" search

More Decks by Guilherme Guitte

Other Decks in Technology

Featured

Transcript