Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unraveling Elasticsearch queries - How to create a "intelligent" search

Guilherme Guitte
September 14, 2016

Unraveling Elasticsearch queries - How to create a "intelligent" search

Unraveling Elasticsearch queries
How to create a "intelligent" search

Who am I?
@guilhermeguitte
Leroy Merlin Brasil.
Co-organizer Laravel Meetup in São Paulo.
Software Developer.
Scrum Master.
http://www.guitte.org

Before to get digging into elasticsearch...

What is "elasticsearch"?

Real-Time Data
Real-Time Advanced Analytics
Massively Distributed
High Availability
Multitenancy
Full-Text Search
Document-Oriented
Schema-Free
Developer-Friendly, RESTful API
Per-Operation Persistence
Build on top of Apache Lucene™

What is a "index"?

It's like a database in traditional relational database.

GET http://localhost:9200/web/orders/_search
index

What is a "type"?

GET http://localhost:9200/web/orders/_search
type

What is a "inverted index"?

It's like...

What you learned?
Basic jargon of elasticsearch.
What is a index.
What is a type.
What it is elasticsearch.

Now, with the basic jargon of Elasticsearch...

Be ready!

Queries

Basic structure

{ "query": {}}

GET http://localhost:9200/web/orders/_search

Structured search

"Finding for documents that exactly match with query"

The result will be "YES" or "NO".

SELECT *
FROM orders
WHERE status = "received"

{ "query": { "term" : { "status": "received" } }}

GET http://localhost:9200/web/orders/_search
{ "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] }}

{ "query": { "constant_score" : { "filter": { "term" : { "status": "received" } } } }}

GET http://localhost:9200/web/orders/_search
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }}

To start simple, but real world is not.

Get used yourself with "bool" queries

{ "query": { "bool" : { "must" : [], "should" : [], "must_not" : [], "filter": [] } }}

GET http://localhost:9200/web/orders/_search

{ "query": { "bool" : { "must": { "term": { "status": "received" } } } }}

GET http://localhost:9200/web/orders/_search
{ "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] }}

{ "query": { "constant_score": { "filter": { "bool" : { "must": { "term": { "status": "received" } } } } } }}

GET http://localhost:9200/web/orders/_search
{ "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }}

"Bool" structure is very flexible

{ "query": { "constant_score": { "filter": { "bool" : { "should": [ { "term": { "status": "received" } }, { "bool": { "must": { "term": { "customer": "Prof. Shaylee Greenholt" } } } } ] } } } }}

GET http://localhost:9200/web/orders/_search
{ "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }}

What types of queries elasticsearch have?

Term
Terms
Range
Exists
Missing
Prefix
Wildcard
Regexp
Fuzzy
Type
Ids
{ "term": { "status": "received" }}
{ "terms": { "status": ["received", "delivering"] }}
{ "range": { "total": {"gte": 100.5, "lte": 140.5 }}}
{ "exists": { "field": "region" }}
{ "missing": { "field": "region" }}
{ "prefix": { "customer": "Dolly" }}
{ "wildcard": { "customer": "Doll*" }}
{ "regexp": { "customer": { "value": "Doll*" }}}
{ "fuzzy": { "customer": { "value": "Doll*", "fuzziness": 2 }}}
{ "type": { "value": "orders"}}
{ "ids": { "type": "orders", "values": ["1", "2"]}}

All queries could boost the scoring if you like, but if all are inside of "constant_score", scoring will not be calculated.

Structure queries is good for:
Filter documents before to run queries that you would like to score your documents.
It's fast because Elasticsearch can cache them and reuse about time.

What we learned?
Structure search
Boolean match with document.
Use bool queries.
Filter documents before runs full-text search.

Full-text search

Two different things about full-text search

Relevance

"How well which document match this query"

TF/IDF
(Term freq./Inverted Document Freq.)

Proximity to a geolocation
Fuzzy similarity
...

_scoremax_score

The simplest query

{ "query": { "match": { "customer": "John" } }}

{ "_score": 4.6189003, "customer": "John Upton" }, { "_score": 4.6189003, "customer": "John Borer" }, { "_score": 4.6189003, "customer": "John Emard" }, { "_score": 4.06103, "customer": "John Runolfsdottir IV" }, { "_score": 3.8275056, "customer": "Mr. John Cartwright III" }, { "_score": 3.8275056, "customer": "John Hodkiewicz DDS" }

GET http://localhost:9200/web/orders/_search

{ "query": { "match": { "customer": "Joh" } }}

{ "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] }}

GET http://localhost:9200/web/orders/_search

Elasticsearch persist the data in a different way what you are accustomed.

To understand full-text search in Elasticsearch, first you need to understand how elasticsearch persist your data.

It's called "Analysis"

Is a pipeline that begins with:

Is a pipeline that begins with:
Create mapping ("Schema") for the web index. (if it's not).
Receive the document from Index API.
Iterate each field and sees if the field are analyzed.
Then run analyzer for this field.
Persist the data.

The document
{ "customer": "Dr. Emiliano, the Mitchell Sr.", "items": [ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719}

"customer": "Dr. Emiliano, the Mitchell Sr."
The document

"Dr. Emiliano, the Mitchell Sr."
The document
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip

"Dr. Emiliano, the Mitchell Sr."
Char filter
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Dr. Emiliano, the Mitchell Sr."
string

"Dr. Emiliano, the Mitchell Sr."
Tokenizer
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Dr."
"Emiliano,"
"the"
"Mitchell"
"Sr."
tokens

Token filter
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"dr."
"emiliano,"
"the"
"mitchell"
"sr."
"Dr."
"Emiliano,"
"the"
"Mitchell"
"Sr."

1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID

1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
TF/IDF (Term Freq/Inverted Document Freq) that generates the score.

{ "query": { "match": { "customer": "Mitchell" } }}

GET http://localhost:9200/web/orders/_search

"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip

"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip

"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip

"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"mitchell"

1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
"mitchell"

1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
"mitchell"

What types of queries elasticsearch have?

Match
multi_match
...
{ "match": { "name": "John" }}
{ "multi_match": { "query": "John", "fields": ["name.raw", "name.autocomplete"]}}

Understand analysis process is must to understand how to search.
Analyzer are compound with: tokenizers, token_filters and char filters.
You need to understand how user will search to make the right query into elasticsearch.

What we learned?

References
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html
https://www.youtube.com/playlist?list=PLZ4puV97Zwm2fEmTLrPsP7QgLsjnnQggX
Official PHP elasticsearch package: https://github.com/elastic/elasticsearch-php
https://github.com/sleimanx2/plastic

Thanks!
@guilhermeguitte
http://www.guitte.org

Guilherme Guitte

September 14, 2016
Tweet

More Decks by Guilherme Guitte

Other Decks in Technology

Transcript

  1. Who am I? @guilhermeguitte • Leroy Merlin Brasil. • Co-organizer

    Laravel Meetup in São Paulo. • Software Developer. • Scrum Master. http://www.guitte.org
  2. • Real-Time Data • Real-Time Advanced Analytics • Massively Distributed

    • High Availability • Multitenancy • Full-Text Search • Document-Oriented • Schema-Free • Developer-Friendly, RESTful API • Per-Operation Persistence • Build on top of Apache Lucene™
  3. What you learned? • Basic jargon of elasticsearch. • What

    is a index. • What is a type. • What it is elasticsearch.
  4. { "query": { "term" : { "status": "received" } }

    } GET http://localhost:9200/web/orders/_search { "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }
  5. { "query": { "constant_score" : { "filter": { "term" :

    { "status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }
  6. { "query": { "bool" : { "must" : [], "should"

    : [], "must_not" : [], "filter": [] } } } GET http://localhost:9200/web/orders/_search
  7. { "query": { "bool" : { "must": { "term": {

    "status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }
  8. { "query": { "constant_score": { "filter": { "bool" : {

    "must": { "term": { "status": "received" } } } } } } } GET http://localhost:9200/web/orders/_search { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }
  9. { "query": { "constant_score": { "filter": { "bool" : {

    "should": [ { "term": { "status": "received" } }, { "bool": { "must": { "term": { "customer": "Prof. Shaylee Greenholt" } } } } ] } } } } GET http://localhost:9200/web/orders/_search { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }
  10. • Term • Terms • Range • Exists • Missing

    • Prefix • Wildcard • Regexp • Fuzzy • Type • Ids { "term": { "status": "received" }} { "terms": { "status": ["received", "delivering"] }} { "range": { "total": {"gte": 100.5, "lte": 140.5 }}} { "exists": { "field": "region" }} { "missing": { "field": "region" }} { "prefix": { "customer": "Dolly" }} { "wildcard": { "customer": "Doll*" }} { "regexp": { "customer": { "value": "Doll*" }}} { "fuzzy": { "customer": { "value": "Doll*", "fuzziness": 2 }}} { "type": { "value": "orders"}} { "ids": { "type": "orders", "values": ["1", "2"]}}
  11. All queries could boost the scoring if you like, but

    if all are inside of "constant_score", scoring will not be calculated.
  12. Structure queries is good for: • Filter documents before to

    run queries that you would like to score your documents. • It's fast because Elasticsearch can cache them and reuse about time.
  13. What we learned? • Structure search • Boolean match with

    document. • Use bool queries. • Filter documents before runs full-text search.
  14. { "query": { "match": { "customer": "John" } } }

    { "_score": 4.6189003, "customer": "John Upton" }, { "_score": 4.6189003, "customer": "John Borer" }, { "_score": 4.6189003, "customer": "John Emard" }, { "_score": 4.06103, "customer": "John Runolfsdottir IV" }, { "_score": 3.8275056, "customer": "Mr. John Cartwright III" }, { "_score": 3.8275056, "customer": "John Hodkiewicz DDS" } GET http://localhost:9200/web/orders/_search
  15. { "query": { "match": { "customer": "Joh" } } }

    { "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } } GET http://localhost:9200/web/orders/_search
  16. To understand full-text search in Elasticsearch, first you need to

    understand how elasticsearch persist your data.
  17. Is a pipeline that begins with: • Create mapping ("Schema")

    for the web index. (if it's not). • Receive the document from Index API. • Iterate each field and sees if the field are analyzed. • Then run analyzer for this field. • Persist the data.
  18. The document { "customer": "<b>Dr.</b> Emiliano, the Mitchell Sr.", "items":

    [ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719 }
  19. "<b>Dr.</b> Emiliano, the Mitchell Sr." Char filter Analyzer: Tokenizer: whitespace

    Token filter: lowercase Char Filter: html_strip "Dr. Emiliano, the Mitchell Sr." string
  20. "Dr. Emiliano, the Mitchell Sr." Tokenizer Analyzer: Tokenizer: whitespace Token

    filter: lowercase Char Filter: html_strip "Dr." "Emiliano," "the" "Mitchell" "Sr." tokens
  21. Token filter Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter:

    html_strip "dr." "emiliano," "the" "mitchell" "sr." "Dr." "Emiliano," "the" "Mitchell" "Sr."
  22. 1 1 1 1 1 Persist it "dr." "emiliano," "the"

    "mitchell" "sr." Token Doc ID
  23. 1 1 1 1 1 Persist it "dr." "emiliano," "the"

    "mitchell" "sr." Token Doc ID TF/IDF (Term Freq/Inverted Document Freq) that generates the score.
  24. { "query": { "match": { "customer": "Mitchell" } } }

    GET http://localhost:9200/web/orders/_search
  25. 1 1 1 1 1 Persist it "dr." "emiliano," "the"

    "mitchell" "sr." Token Doc ID "mitchell"
  26. 1 1 1 1 1 Persist it "dr." "emiliano," "the"

    "mitchell" "sr." Token Doc ID "mitchell"
  27. • Match • multi_match • ... { "match": { "name":

    "John" }} { "multi_match": { "query": "John", "fields": ["name.raw", "name.autocomplete"]}}
  28. • Understand analysis process is must to understand how to

    search. • Analyzer are compound with: tokenizers, token_filters and char filters. • You need to understand how user will search to make the right query into elasticsearch. What we learned?