Slide 1

Slide 1 text

Unraveling Elasticsearch queries How to create a "intelligent" search

Slide 2

Slide 2 text

Who am I? @guilhermeguitte ● Leroy Merlin Brasil. ● Co-organizer Laravel Meetup in São Paulo. ● Software Developer. ● Scrum Master. http://www.guitte.org

Slide 3

Slide 3 text

Before to get digging into elasticsearch...

Slide 4

Slide 4 text

What is "elasticsearch"?

Slide 5

Slide 5 text

● Real-Time Data ● Real-Time Advanced Analytics ● Massively Distributed ● High Availability ● Multitenancy ● Full-Text Search ● Document-Oriented ● Schema-Free ● Developer-Friendly, RESTful API ● Per-Operation Persistence ● Build on top of Apache Lucene™

Slide 6

Slide 6 text

What is a "index"?

Slide 7

Slide 7 text

It's like a database in traditional relational database.

Slide 8

Slide 8 text

GET http://localhost:9200/web/orders/_search index

Slide 9

Slide 9 text

What is a "type"?

Slide 10

Slide 10 text

GET http://localhost:9200/web/orders/_search type

Slide 11

Slide 11 text

What is a "inverted index"?

Slide 12

Slide 12 text

It's like...

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

What you learned? ● Basic jargon of elasticsearch. ● What is a index. ● What is a type. ● What it is elasticsearch.

Slide 16

Slide 16 text

Now, with the basic jargon of Elasticsearch...

Slide 17

Slide 17 text

Be ready!

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Queries

Slide 20

Slide 20 text

Basic structure

Slide 21

Slide 21 text

{ "query": {} } GET http://localhost:9200/web/orders/_search

Slide 22

Slide 22 text

Structured search

Slide 23

Slide 23 text

"Finding for documents that exactly match with query"

Slide 24

Slide 24 text

The result will be "YES" or "NO".

Slide 25

Slide 25 text

SELECT * FROM orders WHERE status = "received"

Slide 26

Slide 26 text

{ "query": { "term" : { "status": "received" } } } GET http://localhost:9200/web/orders/_search { "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }

Slide 27

Slide 27 text

{ "query": { "constant_score" : { "filter": { "term" : { "status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }

Slide 28

Slide 28 text

To start simple, but real world is not.

Slide 29

Slide 29 text

Get used yourself with "bool" queries

Slide 30

Slide 30 text

{ "query": { "bool" : { "must" : [], "should" : [], "must_not" : [], "filter": [] } } } GET http://localhost:9200/web/orders/_search

Slide 31

Slide 31 text

{ "query": { "bool" : { "must": { "term": { "status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }

Slide 32

Slide 32 text

{ "query": { "constant_score": { "filter": { "bool" : { "must": { "term": { "status": "received" } } } } } } } GET http://localhost:9200/web/orders/_search { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

Slide 33

Slide 33 text

"Bool" structure is very flexible

Slide 34

Slide 34 text

{ "query": { "constant_score": { "filter": { "bool" : { "should": [ { "term": { "status": "received" } }, { "bool": { "must": { "term": { "customer": "Prof. Shaylee Greenholt" } } } } ] } } } } GET http://localhost:9200/web/orders/_search { "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

Slide 35

Slide 35 text

What types of queries elasticsearch have?

Slide 36

Slide 36 text

● Term ● Terms ● Range ● Exists ● Missing ● Prefix ● Wildcard ● Regexp ● Fuzzy ● Type ● Ids { "term": { "status": "received" }} { "terms": { "status": ["received", "delivering"] }} { "range": { "total": {"gte": 100.5, "lte": 140.5 }}} { "exists": { "field": "region" }} { "missing": { "field": "region" }} { "prefix": { "customer": "Dolly" }} { "wildcard": { "customer": "Doll*" }} { "regexp": { "customer": { "value": "Doll*" }}} { "fuzzy": { "customer": { "value": "Doll*", "fuzziness": 2 }}} { "type": { "value": "orders"}} { "ids": { "type": "orders", "values": ["1", "2"]}}

Slide 37

Slide 37 text

All queries could boost the scoring if you like, but if all are inside of "constant_score", scoring will not be calculated.

Slide 38

Slide 38 text

Structure queries is good for: ● Filter documents before to run queries that you would like to score your documents. ● It's fast because Elasticsearch can cache them and reuse about time.

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

What we learned? ● Structure search ● Boolean match with document. ● Use bool queries. ● Filter documents before runs full-text search.

Slide 41

Slide 41 text

Full-text search

Slide 42

Slide 42 text

Two different things about full-text search

Slide 43

Slide 43 text

Relevance

Slide 44

Slide 44 text

"How well which document match this query"

Slide 45

Slide 45 text

TF/IDF (Term freq./Inverted Document Freq.)

Slide 46

Slide 46 text

Proximity to a geolocation Fuzzy similarity ...

Slide 47

Slide 47 text

_score max_score

Slide 48

Slide 48 text

The simplest query

Slide 49

Slide 49 text

{ "query": { "match": { "customer": "John" } } } { "_score": 4.6189003, "customer": "John Upton" }, { "_score": 4.6189003, "customer": "John Borer" }, { "_score": 4.6189003, "customer": "John Emard" }, { "_score": 4.06103, "customer": "John Runolfsdottir IV" }, { "_score": 3.8275056, "customer": "Mr. John Cartwright III" }, { "_score": 3.8275056, "customer": "John Hodkiewicz DDS" } GET http://localhost:9200/web/orders/_search

Slide 50

Slide 50 text

{ "query": { "match": { "customer": "Joh" } } } { "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } } GET http://localhost:9200/web/orders/_search

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Elasticsearch persist the data in a different way what you are accustomed.

Slide 53

Slide 53 text

To understand full-text search in Elasticsearch, first you need to understand how elasticsearch persist your data.

Slide 54

Slide 54 text

It's called "Analysis"

Slide 55

Slide 55 text

Is a pipeline that begins with:

Slide 56

Slide 56 text

Is a pipeline that begins with: ● Create mapping ("Schema") for the web index. (if it's not). ● Receive the document from Index API. ● Iterate each field and sees if the field are analyzed. ● Then run analyzer for this field. ● Persist the data.

Slide 57

Slide 57 text

The document { "customer": "Dr. Emiliano, the Mitchell Sr.", "items": [ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719 }

Slide 58

Slide 58 text

"customer": "Dr. Emiliano, the Mitchell Sr." The document

Slide 59

Slide 59 text

"Dr. Emiliano, the Mitchell Sr." The document Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip

Slide 60

Slide 60 text

"Dr. Emiliano, the Mitchell Sr." Char filter Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip "Dr. Emiliano, the Mitchell Sr." string

Slide 61

Slide 61 text

"Dr. Emiliano, the Mitchell Sr." Tokenizer Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip "Dr." "Emiliano," "the" "Mitchell" "Sr." tokens

Slide 62

Slide 62 text

Token filter Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip "dr." "emiliano," "the" "mitchell" "sr." "Dr." "Emiliano," "the" "Mitchell" "Sr."

Slide 63

Slide 63 text

1 1 1 1 1 Persist it "dr." "emiliano," "the" "mitchell" "sr." Token Doc ID

Slide 64

Slide 64 text

1 1 1 1 1 Persist it "dr." "emiliano," "the" "mitchell" "sr." Token Doc ID TF/IDF (Term Freq/Inverted Document Freq) that generates the score.

Slide 65

Slide 65 text

{ "query": { "match": { "customer": "Mitchell" } } } GET http://localhost:9200/web/orders/_search

Slide 66

Slide 66 text

"Mitchell" The term Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip

Slide 67

Slide 67 text

"Mitchell" The term Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip

Slide 68

Slide 68 text

"Mitchell" The term Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip

Slide 69

Slide 69 text

"Mitchell" The term Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter: html_strip "mitchell"

Slide 70

Slide 70 text

1 1 1 1 1 Persist it "dr." "emiliano," "the" "mitchell" "sr." Token Doc ID "mitchell"

Slide 71

Slide 71 text

1 1 1 1 1 Persist it "dr." "emiliano," "the" "mitchell" "sr." Token Doc ID "mitchell"

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

What types of queries elasticsearch have?

Slide 75

Slide 75 text

● Match ● multi_match ● ... { "match": { "name": "John" }} { "multi_match": { "query": "John", "fields": ["name.raw", "name.autocomplete"]}}

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

● Understand analysis process is must to understand how to search. ● Analyzer are compound with: tokenizers, token_filters and char filters. ● You need to understand how user will search to make the right query into elasticsearch. What we learned?

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

References ● https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html ● https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html ● https://www.youtube.com/playlist?list=PLZ4puV97Zwm2fEmTLrPsP7QgLsjnnQggX ● Official PHP elasticsearch package: https://github.com/elastic/elasticsearch-php ● https://github.com/sleimanx2/plastic

Slide 80

Slide 80 text

Thanks! @guilhermeguitte http://www.guitte.org