Elasticsearch Workshop - Speaker Deck

Slide 1

Slide 1 text

W o r k s h o p ELASTICSEARCH Felipe Dornelas

Slide 2

Slide 2 text

AGENDA ▫︎Part 1 ▫︎Introduction ▫︎Document Store ▫︎Search Examples ▫︎Data Resiliency ▫︎Comparison with Solr ▫︎Part 2 ▫︎Search ▫︎Analytics 2

Slide 3

Slide 3 text

AGENDA ▫︎Part 3 ▫︎Inverted Index ▫︎Analyzers ▫︎Mapping ▫︎Proximity Matching ▫︎Fuzzy Matching ▫︎Part 4 ▫︎Inside a Cluster ▫︎Data Modeling 3

Slide 4

Slide 4 text

→ github.com/felipead/ elasticsearch-workshop 4

Slide 5

Slide 5 text

PRE-REQUISITES ▫︎Vagrant ▫︎VirtualBox ▫︎Git 5

Slide 6

Slide 6 text

ENVIRONMENT SETUP ▫︎git clone https://github.com/ felipead/elasticsearch-workshop.git ▫︎vagrant up ▫︎vagrant ssh ▫︎cd /vagrant 6

Slide 7

Slide 7 text

VERIFY EVERYTHING IS WORKING ▫︎curl http://localhost:9200 7

Slide 8

Slide 8 text

PART 1 Core concepts 8

Slide 9

Slide 9 text

1-1 INTRODUCTION You know, for search 9

Slide 10

Slide 10 text

WHAT IS ELASTICSEARCH? A real-time distributed search and analytics engine 10

Slide 11

Slide 11 text

IT CAN BE USED FOR ▫︎Full-text search ▫︎Structured search ▫︎Real-time analytics ▫︎…or any combination of the above 11

Slide 12

Slide 12 text

FEATURES ▫︎Distributed document store: ▫︎RESTful API ▫︎Automatic scale ▫︎Plug & Play ™ 12

Slide 13

Slide 13 text

FEATURES ▫︎Handles the human language: ▫︎Score results by relevance ▫︎Synonyms ▫︎Typos and misspellings ▫︎Internationalization 13

Slide 14

Slide 14 text

FEATURES ▫︎Powerful analytics: ▫︎Comprehensive aggregations ▫︎Geolocations ▫︎Can be combined with search ▫︎Real-time (no batch-processing) 14

Slide 15

Slide 15 text

FEATURES ▫︎Free and open source ▫︎Community support ▫︎Backed by Elastic 15

Slide 16

Slide 16 text

MOTIVATION Most databases are inept at extracting knowledge from your data 16

Slide 17

Slide 17 text

SQL DATABASES SQL = Structured Query Language 17

Slide 18

Slide 18 text

SQL DATABASES ▫︎Can only ﬁlter by exact values ▫︎Unable to perform full-text search ▫︎Queries can be complex and ineﬃcient ▫︎Often requires big-batch processing 18

Slide 19

Slide 19 text

APACHE LUCENE ▫︎Arguably, the best search engine ▫︎High performance ▫︎Near real-time indexing ▫︎Open source 19

Slide 20

Slide 20 text

APACHE LUCENE ▫︎But… ▫︎It’s just a Java Library ▫︎Hard to use 20

Slide 21

Slide 21 text

ELASTICSEARCH ▫︎Document Store ▫︎Distributed ▫︎Scalable ▫︎Real Time ▫︎Analytics ▫︎RESTful API ▫︎Easy to Use 21

Slide 22

Slide 22 text

DOCUMENT ORIENTED ▫︎Documents instead of rows / columns ▫︎Every ﬁeld is indexed and searchable ▫︎Serialized to JSON ▫︎Schemaless 22

Slide 23

Slide 23 text

WHO USES ▫︎GitHub ▫︎Wikipedia ▫︎Stack Overﬂow ▫︎The Guardian 23

Slide 24

Slide 24 text

TALKING TO ELASTICSEARCH ▫︎Java API ▫︎Port 9300 ▫︎Native transport protocol ▫︎Node client (joins the cluster) ▫︎Transport client (doesn't join the cluster) 24

Slide 25

Slide 25 text

TALKING TO ELASTICSEARCH ▫︎RESTful API ▫︎Port 9200 ▫︎JSON over HTTP 25

Slide 26

Slide 26 text

TALKING TO ELASTICSEARCH We will only cover the RESTful API 26

Slide 27

Slide 27 text

USING CURL curl -X -d or curl -X -d @ 27

Slide 28

Slide 28 text

THE EMPTY QUERY curl -X GET -d @part-1/empty-query.json localhost:9200/_count?pretty 28

Slide 29

Slide 29 text

REQUEST { "query": { "match_all": {} } } 29

Slide 30

Slide 30 text

RESPONSE { "count": 0, "_shards": { "total": 0, "successful": 0, "failed": 0 } } 30

Slide 31

Slide 31 text

1-2 DOCUMENT STORE 31

Slide 32

Slide 32 text

THE PROBLEM WITH RELATIONAL DATABASES ▫︎Stores data in columns and rows ▫︎Equivalent of using a spreadsheet ▫︎Inﬂexible storage medium ▫︎Not suitable for rich objects 32

Slide 33

Slide 33 text

DOCUMENTS { "name": "John Smith", "age": 42, "confirmed": true, "join_date": "2015-06-01", "home": {"lat": 51.5, "lon": 0.1}, "accounts": [ {"type": "facebook", "id": "johnsmith"}, {"type": "twitter", "id": "johnsmith"} ] } 33

Slide 34

Slide 34 text

DOCUMENT METADATA ▫︎Index - Where the document lives ▫︎Type - Class of object that the document represents ▫︎Id - Unique identiﬁer for the document 34

Slide 35

Slide 35 text

DOCUMENT METADATA 35 Relational DB Databases Tables Rows Columns Elasticsearch Indices Types Documents Fields

Slide 36

Slide 36 text

RESTFUL API [VERB] /{index}/{type}/{id}?pretty GET | POST | PUT | DELETE | HEAD 36

Slide 37

Slide 37 text

RESTFUL API ▫︎JSON-only ▫︎Adding pretty to the query-string parameters pretty-prints the response 37

Slide 38

Slide 38 text

INDEXING A DOCUMENT WITH YOUR OWN ID PUT /{index}/{type}/{id} 38

Slide 39

Slide 39 text

INDEXING A DOCUMENT WITH YOUR OWN ID curl -X PUT -d @part-1/first-blog-post.json localhost:9200/blog/post/123?pretty 39

Slide 40

Slide 40 text

REQUEST { "title": "My first blog post", "text": "Just trying this out...", "date": "2014-01-01" } 40

Slide 41

Slide 41 text

RESPONSE { "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } 41

Slide 42

Slide 42 text

INDEXING A DOCUMENT WITH AUTOGENERATED ID POST /{index}/{type} * Autogenerated IDs are Base64-encoded UUIDs 42

Slide 43

Slide 43 text

INDEXING A DOCUMENT WITH AUTOGENERATED ID curl -X POST -d @part-1/second-blog-post.json localhost:9200/blog/post?pretty 43

Slide 44

Slide 44 text

REQUEST { "title": "Second blog post", "text": "Still trying this out...", "date": "2014-01-01" } 44

Slide 45

Slide 45 text

RESPONSE { "_index" : "blog", "_type" : "post", "_id" : "AVFWIbMf7YZ6Se7RwMws", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } 45

Slide 46

Slide 46 text

RETRIEVING A DOCUMENT WITH METADATA GET /{index}/{type}/{id} 46

Slide 47

Slide 47 text

RETRIEVING A DOCUMENT WITH METADATA curl -X GET localhost:9200/blog/post/123?pretty 47

Slide 48

Slide 48 text

RESPONSE { "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 1, "found" : true, "_source": { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014-01-01" } } 48

Slide 49

Slide 49 text

RETRIEVING A DOCUMENT WITHOUT METADATA GET /{index}/{type}/{id}/_source 49

Slide 50

Slide 50 text

RETRIEVING A DOCUMENT WITHOUT METADATA curl -X GET localhost:9200/blog/post/123/ _source?pretty 50

Slide 51

Slide 51 text

RESPONSE { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014-01-01" } 51

Slide 52

Slide 52 text

RETRIEVING PART OF A DOCUMENT GET /{index}/{type}/{id} ?_source={fields} 52

Slide 53

Slide 53 text

RETRIEVING PART OF A DOCUMENT curl -X GET 'localhost:9200/blog/post/123? _source=title,date&pretty' 53

Slide 54

Slide 54 text

RESPONSE { "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 1, "found" : true, "_source": { "title": "My first blog entry", "date": "2014-01-01" } } 54

Slide 55

Slide 55 text

CHECKING WHETHER A DOCUMENT EXISTS HEAD /{index}/{type}/{id} 55

Slide 56

Slide 56 text

CHECKING WHETHER A DOCUMENT EXISTS curl -i —X HEAD localhost:9200/blog/post/123 56

Slide 57

Slide 57 text

RESPONSE HTTP/1.1 200 OK Content-Length: 0 57

Slide 58

Slide 58 text

CHECKING WHETHER A DOCUMENT EXISTS curl -i —X HEAD localhost:9200/blog/post/666 58

Slide 59

Slide 59 text

RESPONSE HTTP/1.1 404 Not Found Content-Length: 0 59

Slide 60

Slide 60 text

UPDATING A WHOLE DOCUMENT PUT /{index}/{type}/{id} 60

Slide 61

Slide 61 text

UPDATING A WHOLE DOCUMENT curl -X PUT -d @part-1/updated-blog-post.json localhost:9200/blog/post/123?pretty 61

Slide 62

Slide 62 text

REQUEST { "title": "My first blog post", "text": "I am starting to get the hang of this...", "date": "2014-01-02" } 62

Slide 63

Slide 63 text

RESPONSE { "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 2, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : false } 63

Slide 64

Slide 64 text

DELETING A DOCUMENT DELETE /{index}/{type}/{id} 64

Slide 65

Slide 65 text

DELETING A DOCUMENT curl -X DELETE localhost:9200/blog/post/123?pretty 65

Slide 66

Slide 66 text

RESPONSE { "found" : true, "_index" : "blog", "_type" : "post", "_id" : "123", "_version" : 3, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 } } 66

Slide 67

Slide 67 text

DEALING WITH CONFLICTS 67

Slide 68

Slide 68 text

PESSIMISTIC CONCURRENCY CONTROL ▫︎Used by relational databases ▫︎Assumes conﬂicts are likely to happen (pessimist) ▫︎Blocks access to resources 68

Slide 69

Slide 69 text

OPTIMISTIC CONCURRENCY CONTROL ▫︎Assumes conﬂicts are unlikely to happen (optimist) ▫︎Does not block operations ▫︎If conﬂict happens, update fails 69

Slide 70

Slide 70 text

HOW ELASTICSEARCH DEALS WITH CONFLICTS ▫︎Locking distributed resources would be very ineﬃcient ▫︎Uses Optimistic Concurrency Control ▫︎Auto-increments _version number 70

Slide 71

Slide 71 text

HOW ELASTICSEARCH DEALS WITH CONFLICTS ▫︎PUT /blog/post/123?version=1 ▫︎If version is outdated returns 409 Conﬂict 71

Slide 72

Slide 72 text

1-3 SEARCH EXAMPLES 72

Slide 73

Slide 73 text

EMPLOYEE DIRECTORY EXAMPLE ▫︎Index: megacorp ▫︎Type: employee ▫︎Ex: John Smith, Jane Smith, Douglas Fir 73

Slide 74

Slide 74 text

EMPLOYEE DIRECTORY EXAMPLE curl -X PUT -d @part-1/john-smith.json localhost:9200/megacorp/employee/1 74

Slide 75

Slide 75 text

REQUEST { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": ["sports", "music"] } 75

Slide 76

Slide 76 text

EMPLOYEE DIRECTORY EXAMPLE curl -X PUT -d @part-1/jane-smith.json localhost:9200/megacorp/employee/2 76

Slide 77

Slide 77 text

REQUEST { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": ["music"] } 77

Slide 78

Slide 78 text

EMPLOYEE DIRECTORY EXAMPLE curl -X PUT -d @part-1/douglas-fir.json localhost:9200/megacorp/employee/3 78

Slide 79

Slide 79 text

REQUEST { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": ["forestry"] } 79

Slide 80

Slide 80 text

SEARCHES ALL EMPLOYEES GET /megacorp/employee/_search 80

Slide 81

Slide 81 text

SEARCHES ALL EMPLOYEES curl -X GET localhost:9200/megacorp/employee/ _search?pretty 81

Slide 82

Slide 82 text

SEARCH WITH QUERY-STRING GET /megacorp/employee/_search ?q=last_name:Smith 82

Slide 83

Slide 83 text

SEARCH WITH QUERY-STRING curl -X GET 'localhost:9200/megacorp/employee/ _search?q=last_name:Smith&pretty' 83

Slide 84

Slide 84 text

RESPONSE "hits" : { "total" : 2, "max_score" : 0.30685282, "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", … } }, { … "_score" : 0.30685282, "_source": { "first_name": "John", "last_name": "Smith", … } } ] } 84

Slide 85

Slide 85 text

SEARCH WITH QUERY DSL curl -X GET -d @part-1/last-name-query.json localhost:9200/megacorp/employee/ _search?pretty 85

Slide 86

Slide 86 text

REQUEST { "query": { "match": { "last_name": "Smith" } } } 86

Slide 87

Slide 87 text

Slide 88

Slide 88 text

SEARCH WITH QUERY DSL AND FILTER curl -X GET -d @part-1/last-name-age-query.json localhost:9200/megacorp/employee/ _search?pretty 88

Slide 89

Slide 89 text

REQUEST "query": { "filtered": { "filter": { "range": { "age": { "gt": 30 } } }, "query": { "match": { "last_name": "Smith" } } } } 89

Slide 90

Slide 90 text

RESPONSE "hits" : { "total" : 1, "max_score" : 0.30685282, "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, … } } ] 90

Slide 91

Slide 91 text

FULL-TEXT SEARCH curl -X GET -d @part-1/full-text—search.json localhost:9200/megacorp/employee/ _search?pretty 91

Slide 92

Slide 92 text

REQUEST { "query": { "match": { "about": "rock climbing" } } } 92

Slide 93

Slide 93 text

RESPONSE "hits" : [{ … "_score" : 0.16273327, "_source": { "first_name": "John", "last_name": "Smith", "about": "I love to go rock climbing", … } }, { … "_score" : 0.016878016, "_source": { "first_name": "Jane", "last_name": "Smith", "about": "I like to collect rock albums", … } }] 93

Slide 94

Slide 94 text

RELEVANCE SCORES ▫︎The _score ﬁeld ranks searches results ▫︎The higher the score, the better 94

Slide 95

Slide 95 text

PHRASE SEARCH curl -X GET -d @part-1/phrase-search.json localhost:9200/megacorp/employee/ _search?pretty 95

Slide 96

Slide 96 text

REQUEST { "query": { "match_phrase": { "about": "rock climbing" } } } 96

Slide 97

Slide 97 text

RESPONSE "hits" : { "total" : 1, "max_score" : 0.23013961, "hits" : [ { … "_score" : 0.23013961, "_source": { "first_name": "John", "last_name": "Smith", "about": "I love to go rock climbing" … } } ] } 97

Slide 98

Slide 98 text

1-4 DATA RESILIENCY 98

Slide 99

Slide 99 text

CALL ME MAYBE ▫︎Jepsen Tests ▫︎Simulates network partition scenarios ▫︎Run several operations against a distributed system ▫︎Verify that the history of those operations makes sense 99

Slide 100

Slide 100 text

NETWORK PARTITION 100

Slide 101

Slide 101 text

ELASTICSEARCH STATUS ▫︎Risk of data loss on network partition and split-brain scenarios 101

Slide 102

Slide 102 text

IT IS NOT SO BAD… ▫︎Still much more resilient than MongoDB ▫︎Elastic is working hard to improve it ▫︎Two-phase commits are planned 102

Slide 103

Slide 103 text

IF YOU REALLY CARE ABOUT YOUR DATA ▫︎Use a more reliable primary data store: ▫︎Cassandra ▫︎Postgres ▫︎Synchronize it to Elasticsearch ▫︎…or set-up comprehensive back-up 103

Slide 104

Slide 104 text

There’s no such thing as a 100% reliable distributed system 104

Slide 105

Slide 105 text

1-5 SOLR COMPARISON 105

Slide 106

Slide 106 text

SOLR ▫︎SolrCloud ▫︎Both: ▫︎Are open-source and mature ▫︎Are based on Apache Lucene ▫︎Have more or less similar features 106

Slide 107

Slide 107 text

SOLR API ▫︎HTTP GET ▫︎Query parameters passed in as URL parameters ▫︎Is not RESTful ▫︎Multiple formats (JSON, XML…) 107

Slide 108

Slide 108 text

SOLR API ▫︎Version 4.4 added Schemaless API ▫︎Older versions require up-front Schema 108

Slide 109

Slide 109 text

ELASTICSEARCH API ▫︎RESTful ▫︎Schemaless ▫︎CRUD document operations ▫︎Manage indices, read metrics, etc… 109

Slide 110

Slide 110 text

ELASTICSEARCH API ▫︎Query DSL ▫︎Better readability ▫︎JSON-only 110

Slide 111

Slide 111 text

SEARCH ▫︎Both are very good with text search ▫︎Both based on Apache Lucene 111

Slide 112

Slide 112 text

EASYNESS OF USE ▫︎Elasticsearch is simpler: ▫︎Just a single process ▫︎Easier API ▫︎SolrCloud requires Apache ZooKeeper 112

Slide 113

Slide 113 text

SOLRCLOUD DATA RESILIENCY ▫︎SolrCloud uses Apache ZooKeeper to discover nodes ▫︎Better at preventing split-brain conditions ▫︎Jepsen Tests pass 113

Slide 114

Slide 114 text

ANALYTICS ▫︎Elasticsearch is the choice for analytics: ▫︎Comprehensive aggregations ▫︎Thousands of metrics ▫︎SolrCloud is not even close 114

Slide 115

Slide 115 text

PART 2 Search and Analytics 115

Slide 116

Slide 116 text

2-1 SEARCH Finding the needle in the haystack 116

Slide 117

Slide 117 text

TWEETS EXAMPLE ▫︎//user ▫︎//tweet 117

Slide 118

Slide 118 text

TWEETS EXAMPLE /us/user/1 { "email": "[email protected]", "name": "John Smith", "username": "@john" } 118

Slide 119

Slide 119 text

TWEETS EXAMPLE /gb/user/2 { "email": "[email protected]", "name": "Mary Jones", "username": "@mary" } 119

Slide 120

Slide 120 text

TWEET EXAMPLE /gb/tweet/3 { "date": "2014-09-13", "name": "Mary Jones", "tweet": "Elasticsearch means full text search has never been so easy", "user_id": 2 } 120

Slide 121

Slide 121 text

TWEETS EXAMPLE ./part-2/load-tweet-data.sh 121

Slide 122

Slide 122 text

GET /_search ▫︎Returns all documents on all indices THE EMPTY SEARCH 122

Slide 123

Slide 123 text

THE EMPTY SEARCH curl -X GET localhost:9200/_search?pretty 123

Slide 124

Slide 124 text

THE EMPTY SEARCH "hits" : { "total" : 14, "hits" : [ { "_index": "us", "_type": "tweet", "_id": "7", "_score": 1, "_source": { "date": "2014-09-17", "name": "John Smith", "tweet": "The Query DSL is really powerful and flexible", "user_id": 2 } }, … 9 RESULTS REMOVED … ] } 124

Slide 125

Slide 125 text

MULTI-INDEX, MULTITYPE SEARCH ▫︎/_search ▫︎/gb/_search ▫︎/gb,us/_search ▫︎/gb/user/_search ▫︎/_all/user,tweet/_search 125

Slide 126

Slide 126 text

PAGINATION ▫︎Returns 10 results per request (default) ▫︎Control parameters: ▫︎size: number of results to return ▫︎from: number of results to skip 126

Slide 127

Slide 127 text

PAGINATION ▫︎GET /_search?size=5 ▫︎GET /_search?size=5&from=5 ▫︎GET /_search?size=5&from=10 127

Slide 128

Slide 128 text

TYPES OF SEARCH ▫︎Structured query on concrete ﬁelds (similar to SQL) ▫︎Full-text query (sorts results by relevance) ▫︎Combination of the two 128

Slide 129

Slide 129 text

SEARCH BY EXACT VALUES ▫︎Examples: ▫︎date ▫︎user ID ▫︎username ▫︎“Does this document match the query?” 129

Slide 130

Slide 130 text

SELECT * FROM user WHERE name = "John Smith" AND user_id = 2 AND date > "2014-09-15" ▫︎SQL queries: SEARCH BY EXACT VALUES 130

Slide 131

Slide 131 text

FULL-TEXT SEARCH ▫︎Examples: ▫︎the text of a tweet ▫︎body of an email ▫︎“How well does this document match the query?” 131

Slide 132

Slide 132 text

FULL-TEXT SEARCH ▫︎UK should also match United Kingdom ▫︎jump should also match jumped, jumps, jumping and leap 132

Slide 133

Slide 133 text

FULL-TEXT SEARCH ▫︎fox news hunting should return stories about hunting on Fox News ▫︎fox hunting news should return news stories about fox hunting 133

Slide 134

Slide 134 text

HOW ELASTICSEARCH PERFORMS TEXT SEARCH ▫︎Analyzes the text ▫︎Tokenizes into terms ▫︎Normalizes the terms ▫︎Builds an inverted index 134

Slide 135

Slide 135 text

LIST OF INDEXED DOCUMENTS 135 ID Text 1 Baseball is played during summer months. 2 Summer is the time for picnics here. 3 Months later we found out why. 4 Why is summer so hot here.

Slide 136

Slide 136 text

INVERTED INDEX 136 Term Frequency Document IDs baseball 1 1 during 1 1 found 1 3 here 2 2, 4 hot 1 4 is 3 1, 2, 4 months 2 1, 3 summer 3 1, 2, 4 the 1 2 why 2 3, 4

Slide 137

Slide 137 text

GET /_search { "query": YOUR_QUERY_HERE } QUERY DSL 137

Slide 138

Slide 138 text

{ "match": { "tweet": "elasticsearch" } } QUERY BY FIELD 138

Slide 139

Slide 139 text

QUERY BY FIELD curl -X GET -d @part-2/elasticsearch-tweets-query.json localhost:9200/_all/tweet/_search 139

Slide 140

Slide 140 text

{ "bool": "must": { "match": { "tweet": "elasticsearch"} }, "must_not": { "match": { "name": "mary" } }, "should": { "match": { "tweet": "full text" } } } QUERY WITH MULTIPLE CLAUSES 140

Slide 141

Slide 141 text

QUERY WITH MULTIPLE CLAUSES curl -X GET -d @part-2/combining-tweet-queries.json localhost:9200/_all/tweet/_search 141

Slide 142

Slide 142 text

"_score": 0.07082729, "_source": { … "name": "John Smith", "tweet": "The Elasticsearch API is really easy to use" }, … "_score": 0.049890988, "_source": { … "name": "John Smith", "tweet": "Elasticsearch surely is one of the hottest new NoSQL products" }, … "_score": 0.03991279, "_source": { … "name": "John Smith", "tweet": "Elasticsearch and I have left the honeymoon stage, and I still love her." } QUERY WITH MULTIPLE CLAUSES 142

Slide 143

Slide 143 text

MOST IMPORTANT QUERIES ▫︎match ▫︎match_all ▫︎multi_match ▫︎bool 143

Slide 144

Slide 144 text

QUERIES VS. FILTERS ▫︎Queries: ▫︎full-text ▫︎“how well does the document match?” ▫︎Filters: ▫︎exact values ▫︎yes-no questions 144

Slide 145

Slide 145 text

QUERIES VS. FILTERS ▫︎The goal of ﬁlters is to reduce the number of documents that have to be examined by a query 145

Slide 146

Slide 146 text

PERFORMANCE COMPARISON ▫︎Filters are easy to cache and can be reused eﬃciently ▫︎Queries are heavier and non-cacheable 146

Slide 147

Slide 147 text

WHEN TO USE WHICH ▫︎Use queries only for full-text search ▫︎Use ﬁlters for anything else 147

Slide 148

Slide 148 text

"filtered": { "filter": { "term": { "user_id": 1 } } } FILTER BY EXACT FIELD VALUES 148

Slide 149

Slide 149 text

FILTER BY EXACT FIELD VALUES curl -X GET -d @part-2/user-id—filter.json localhost:9200/_search 149

Slide 150

Slide 150 text

"filtered": { "filter": { "range": { "date": { "gte": "2014-09-20" } } } } FILTER BY EXACT FIELD VALUES 150

Slide 151

Slide 151 text

FILTER BY EXACT FIELD VALUES curl -X GET -d @part-2/date—filter.json localhost:9200/_search 151

Slide 152

Slide 152 text

MOST IMPORTANT FILTERS ▫︎term ▫︎terms ▫︎range ▫︎exists and missing ▫︎bool 152

Slide 153

Slide 153 text

"filtered": { "query": { "match": { "tweet": "elasticsearch" } }, "filter": { "term": { "user_id": 1 } } } COMBINING QUERIES WITH FILTERS 153

Slide 154

Slide 154 text

COMBINING QUERIES WITH FILTERS curl -X GET -d @part-2/filtered—tweet-query.json localhost:9200/_search 154

Slide 155

Slide 155 text

SORTING ▫︎Relevance score ▫︎The higher the score, the better ▫︎By default, results are returned in descending order of relevance ▫︎You can sort by any ﬁeld 155

Slide 156

Slide 156 text

RELEVANCE SCORE ▫︎Similarity algorithm ▫︎Term Frequency / Inverse Document Frequency (TF/IDF) 156

Slide 157

Slide 157 text

RELEVANCE SCORE ▫︎Term frequency ▫︎How often does the term appear in the ﬁeld? ▫︎The more often, the more relevant 157

Slide 158

Slide 158 text

RELEVANCE SCORE ▫︎Inverse document frequency ▫︎How often does each term appear in the index? ▫︎The more often, the less relevant 158

Slide 159

Slide 159 text

RELEVANCE SCORE ▫︎Field-length norm ▫︎How long is the ﬁeld? ▫︎The longer it is, the less likely it is that words in the ﬁeld will be relevant 159

Slide 160

Slide 160 text

2-2 ANALYTICS How many needles are in the haystack? 160

Slide 161

Slide 161 text

SEARCH ▫︎Just looks for the needle in the haystack 161

Slide 162

Slide 162 text

BUSINESS QUESTIONS ▫︎How many needles are in the haystack? ▫︎What is the needle average length? ▫︎What is the median length of the needles, by manufacturer? ▫︎How many needles were added to the haystack each month? 162

Slide 163

Slide 163 text

BUSINESS QUESTIONS ▫︎What are your most popular needle manufactures? ▫︎Are there any anomalous clumps of needles? 163

Slide 164

Slide 164 text

AGGREGATIONS ▫︎Answer Analytics questions ▫︎Can be combined with Search ▫︎Near real-time in Elasticsearch ▫︎SQL queries can take days 164

Slide 165

Slide 165 text

AGGREGATIONS Buckets + Metrics 165

Slide 166

Slide 166 text

BUCKETS ▫︎Collection of documents that meet a certain criteria ▫︎Can be nested inside other buckets 166

Slide 167

Slide 167 text

BUCKETS ▫︎Employee 㱺 male or female bucket ▫︎San Francisco 㱺 California bucket ▫︎2014-10-28 㱺 October bucket 167

Slide 168

Slide 168 text

METRICS ▫︎Calculations on top of buckets ▫︎Answer the questions ▫︎Ex: min, max, mean, sum… 168

Slide 169

Slide 169 text

EXAMPLE ▫︎Partition by country (bucket) ▫︎…then partition by gender (bucket) ▫︎…then partition by age ranges (bucket) ▫︎…calculate the average salary for each age range (metric) 169

Slide 170

Slide 170 text

CAR TRANSACTIONS EXAMPLE ▫︎/cars/transactions 170

Slide 171

Slide 171 text

CAR TRANSACTIONS EXAMPLE /cars/transactions/ AVFr1xbVmdUYWpF46Ps4 { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } 171

Slide 172

Slide 172 text

CAR TRANSACTIONS EXAMPLE ./part-2/load-car-data.sh 172

Slide 173

Slide 173 text

{ "aggs": { "colors": { "terms": { "fields": "color" } } } } BEST SELLING CAR COLOR 173

Slide 174

Slide 174 text

BEST SELLING CAR COLOR curl -X GET -d @part-2/best-selling-car-color.json 'localhost:9200/cars/transactions/ _search?search_type=count&pretty' 174

Slide 175

Slide 175 text

"colors" : { "buckets" : [{ "key" : "red", "doc_count" : 16 }, { "key" : "blue", "doc_count" : 8 }, { "key" : "green", "doc_count" : 8 }] } BEST SELLING CAR COLOR 175

Slide 176

Slide 176 text

{ "aggs": { "colors": { "terms": { "field": "color" }, "aggs": { "avg_price": { "avg": { "field": "price" } } } } } } AVERAGE CAR COLOR PRICE 176

Slide 177

Slide 177 text

AVERAGE CAR COLOR PRICE curl -X GET -d @part-2/average-car—color-price.json 'localhost:9200/cars/transactions/ _search?search_type=count&pretty' 177

Slide 178

Slide 178 text

"colors" : { "buckets": [{ "key": "red", "doc_count": 16, "avg_price": { "value": 32500.0 } }, { "key": "blue", "doc_count": 8, "avg_price": { "value": 20000.0 } }, { "key": "green", "doc_count": 8, "avg_price": { "value": 21000.0 } }] } AVERAGE CAR COLOR PRICE 178

Slide 179

Slide 179 text

BUILDING BAR CHARTS ▫︎Very easy to convert aggregations to charts and graphs ▫︎Ex: histograms and time-series 179

Slide 180

Slide 180 text

{ "aggs": { "price": { "histogram": { "field": "price", "interval": 20000 }, "aggs": { "revenue": {"sum": {"field" : "price"}} } } } } CAR SALES REVENUE HISTOGRAM 180

Slide 181

Slide 181 text

CAR SALES REVENUE HISTOGRAM curl -X GET -d @part-2/car-revenue-histogram.json 'localhost:9200/cars/transactions/ _search?search_type=count&pretty' 181

Slide 182

Slide 182 text

"price" : { "buckets": [ { "key": 0, "doc_count": 12, "revenue": {"value": 148000.0} }, { "key": 20000, "doc_count": 16, "revenue": {"value": 380000.0} }, { "key": 40000, "doc_count": 0, "revenue": {"value": 0.0} }, { "key": 60000, "doc_count": 0, "revenue": {"value": 0.0} }, { "key": 80000, "doc_count": 4, "revenue": {"value" : 320000.0} } ]} CAR SALES REVENUE HISTOGRAM 182

Slide 183

Slide 183 text

CAR SALES REVENUE HISTOGRAM 183

Slide 184

Slide 184 text

TIME-SERIES DATA ▫︎Data with a timestamp: ▫︎How many cars sold each month this year? ▫︎What was the price of this stock for the last 12 hours? ▫︎What was the average latency of our website every hour in the last week? 184

Slide 185

Slide 185 text

{ "aggs": { "sales": { "date_histogram": { "field": "sold", "interval": "month", "format": "yyyy-MM-dd" } } } } HOW MANY CARS SOLD PER MONTH? 185

Slide 186

Slide 186 text

HOW MANY CARS SOLD PER MONTH? curl -X GET -d @part-2/car-sales-per-month.json 'localhost:9200/cars/transactions/ _search?search_type=count&pretty' 186

Slide 187

Slide 187 text

"sales" : { "buckets" : [ {"key_as_string": "2014-01-01", "doc_count": 4}, {"key_as_string": "2014-02-01", "doc_count": 4}, {"key_as_string": "2014-03-01", "doc_count": 0}, {"key_as_string": "2014-04-01", "doc_count": 0}, {"key_as_string": "2014-05-01", "doc_count": 4}, {"key_as_string": "2014-06-01", "doc_count": 0}, {"key_as_string": "2014-07-01", "doc_count": 4}, {"key_as_string": "2014-08-01", "doc_count": 4}, {"key_as_string": "2014-09-01", "doc_count": 0}, {"key_as_string": "2014-10-01", "doc_count": 4}, {"key_as_string": "2014-11-01", "doc_count": 8} ] } HOW MANY CARS SOLD PER MONTH? 187

Slide 188

Slide 188 text

HOW MANY CARS SOLD PER MONTH? 188

Slide 189

Slide 189 text

PART 3 Dealing with human language 189

Slide 190

Slide 190 text

3-1 INVERTED INDEX 190

Slide 191

Slide 191 text

INVERTED INDEX ▫︎Data structure ▫︎Eﬃcient full-text search 191

Slide 192

Slide 192 text

EXAMPLE 192 The quick brown fox jumped over the lazy dog Quick brown foxes leap over lazy dogs in summer Document 1 Document 2

Slide 193

Slide 193 text

TOKENIZATION 193 ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"] ["Quick", "brown", "foxes", "leap", "over", "lazy", "dogs", "in", "summer"] Document 1 Document 2

Slide 194

Slide 194 text

194 Term Document 1 Document 2 Quick The brown dog dogs fox foxes in jumped lazy leap over quick summer the

Slide 195

Slide 195 text

EXAMPLE ▫︎Searching for “quick brown” ▫︎Naive similarity algorithm: ▫︎Document 1 is a better match 195 Term Document 1 Document 2 brown quick Total 2 1

Slide 196

Slide 196 text

A FEW PROBLEMS ▫︎Quick and quick are the same word ▫︎fox and foxes are pretty similar ▫︎jumped and leap are synonyms 196

Slide 197

Slide 197 text

NORMALIZATION ▫︎Quick lowercased to quick ▫︎foxes stemmed to fox ▫︎jumped and leap replaced by jump 197

Slide 198

Slide 198 text

BETTER INVERTED INDEX 198 Term Document 1 Document 2 brown dog fox in jump lazy over quick summer the

Slide 199

Slide 199 text

SEARCH INPUT ▫︎You can only ﬁnd terms that exist in the inverted index ▫︎The query string is also normalized 199

Slide 200

Slide 200 text

3-2 ANALYZERS 200

Slide 201

Slide 201 text

ANALYSIS ▫︎Tokenizes a block of text into terms ▫︎Normalizes terms to standard form ▫︎Improves searchability 201

Slide 202

Slide 202 text

ANALYZERS ▫︎Pipeline: ▫︎Character ﬁlters ▫︎Tokenizer ▫︎Token ﬁlters 202

Slide 203

Slide 203 text

BUILT-IN ANALYZERS ▫︎Standard analyzer ▫︎Language-speciﬁc analyzers ▫︎30+ languages supported 203

Slide 204

Slide 204 text

GET /_analyze? analyzer=standard The quick brown fox jumped over the lazy dog. TESTING THE STANDARD ANALYZER 204

Slide 205

Slide 205 text

TESTING THE STANDARD ANALYZER curl -X GET -d @part-3/quick-brown-fox.txt 'localhost:9200/_analyze? analyzer=standard&pretty' 205

Slide 206

Slide 206 text

"tokens" : [ {"token": "the", …}, {"token": "quick", …}, {"token": "brown", …}, {"token": "fox", …}, {"token": "jumps", …}, {"token": "over", …}, {"token": "the", …}, {"token": "lazy", …}, {"token": "dog", …} ] TESTING THE STANDARD ANALYZER 206

Slide 207

Slide 207 text

GET /_analyze?analyzer=english The quick brown fox jumped over the lazy dog. TESTING THE ENGLISH ANALYZER 207

Slide 208

Slide 208 text

TESTING THE ENGLISH ANALYZER curl -X GET -d @part-3/quick-brown-fox.txt 'localhost:9200/_analyze? analyzer=english&pretty' 208

Slide 209

Slide 209 text

"tokens" : [ {"token": "quick", …}, {"token": "brown", …}, {"token": "fox", …}, {"token": "jump", …}, {"token": "over", …}, {"token": "lazi", …}, {"token": "dog", …} ] TESTING THE ENGLISH ANALYZER 209

Slide 210

Slide 210 text

GET /_analyze? analyzer=brazilian A rápida raposa marrom pulou sobre o cachorro preguiçoso. TESTING THE BRAZILIAN ANALYZER 210

Slide 211

Slide 211 text

TESTING THE BRAZILIAN ANALYZER curl -X GET -d @part-3/raposa-rapida.txt 'localhost:9200/_analyze? analyzer=brazilian&pretty' 211

Slide 212

Slide 212 text

"tokens" : [ {"token": "rap", …}, {"token": "rapos", …}, {"token": "marrom", …}, {"token": "pul", …}, {"token": "cachorr", …}, {"token": "preguic", …} ] TESTING THE BRAZILIAN ANALYZER 212

Slide 213

Slide 213 text

STEMMERS ▫︎Algorithmic stemmers: ▫︎Faster ▫︎Less precise ▫︎Dictionary stemmers: ▫︎Slower ▫︎More precise 213

Slide 214

Slide 214 text

3-3 MAPPING 214

Slide 215

Slide 215 text

MAPPING ▫︎Every document has a type ▫︎Every type has its own mapping ▫︎A mapping defines: ▫︎The fields ▫︎The datatype for each field 215

Slide 216

Slide 216 text

MAPPING ▫︎Elasticsearch guesses the mapping when a new ﬁeld is added ▫︎Should customize the mapping for improved search and performance ▫︎Must customize the mapping when type is created 216

Slide 217

Slide 217 text

MAPPING ▫︎A ﬁeld's mapping cannot be changed ▫︎You can still add new ﬁelds ▫︎Only option is to reindex all documents ▫︎Reindexing with zero-downtime: ▫︎index aliases 217

Slide 218

Slide 218 text

CORE FIELD TYPES ▫︎String ▫︎Integer ▫︎Floating-point ▫︎Boolean ▫︎Date ▫︎Inner Objects 218

Slide 219

Slide 219 text

GET /{index}/_mapping/{type} VIEWING THE MAPPING 219

Slide 220

Slide 220 text

VIEWING THE MAPPING curl -X GET 'localhost:9200/gb/_mapping/ tweet?pretty' 220

Slide 221

Slide 221 text

"date": { "type": "date", "format": "strict_date_optional_time…" }, "name": { "type": "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" } VIEWING THE MAPPING 221

Slide 222

Slide 222 text

CUSTOMIZING FIELD MAPPINGS ▫︎Distinguish between: ▫︎Full-text string fields ▫︎Exact value string fields ▫︎Use language-specific analyzers 222

Slide 223

Slide 223 text

STRING MAPPING ATTRIBUTES ▫︎index: ▫︎analyzed (full-text search, default) ▫︎not_analyzed (exact value) ▫︎analyzer: ▫︎standard (default) ▫︎english ▫︎… 223

Slide 224

Slide 224 text

PUT /gb,us/_mapping/tweet { "properties": { "description": { "type": "string", "index": "analyzed", "analyzer": "english" } } } ADDING NEW SEARCHABLE FIELD 224

Slide 225

Slide 225 text

ADDING NEW SEARCHABLE FIELD curl -X PUT -d @part-3/add-new-mapping.json 'localhost:9200/gb,us/ _mapping/tweet?pretty' 225

Slide 226

Slide 226 text

ADDING NEW SEARCHABLE FIELD curl -X GET 'localhost:9200/us,gb/ _mapping/tweet?pretty' 226

Slide 227

Slide 227 text

… "description": { "type": "string", "analyzer": "english" }… ADDING NEW SEARCHABLE FIELD 227

Slide 228

Slide 228 text

3-4 PROXIMITY MATCHING 228

Slide 229

Slide 229 text

THE PROBLEM ▫︎Sue ate the alligator ▫︎The alligator ate Sue ▫︎Sue never goes anywhere without her alligator-skin purse 229

Slide 230

Slide 230 text

THE PROBLEM ▫︎Search for “sue alligator” would match all three ▫︎Sue and alligator may be separated by paragraphs of other text 230

Slide 231

Slide 231 text

HEURISTIC ▫︎Words that appear near each other are probably related ▫︎Give documents in which the words are close together a higher relevance score 231

Slide 232

Slide 232 text

GET /_analyze? analyzer=standard Quick brown fox. TERM POSITIONS 232

Slide 233

Slide 233 text

"tokens": [ { "token": "quick", … "position": 1 }, { "token": "brown", … "position": 2 }, { "token": "fox", … "position": 3 } ] TERM POSITIONS 233

Slide 234

Slide 234 text

GET /{index}/{type}/_search { "query": { "match_phrase": { "title": "quick brown fox" } } } EXACT PHRASE MATCHING 234

Slide 235

Slide 235 text

EXACT PHRASE MATCHING ▫︎quick, brown and fox must all appear ▫︎The position of brown must be 1 greater than the position of quick ▫︎The position of fox must be 2 greater than the position of quick 235 quick brown fox

Slide 236

Slide 236 text

FLEXIBLE PHRASE MATCHING ▫︎Exact phrase matching is too strict ▫︎“quick fox” should also match ▫︎Slop matching 236 quick brown fox

Slide 237

Slide 237 text

"query": { "match_phrase": { "title": { "query": "quick fox", "slop": 1 } } } FLEXIBLE PHRASE MATCHING 237

Slide 238

Slide 238 text

SLOP MATCHING ▫︎How many times you are allowed to move a term in order to make the query and document match? ▫︎Slop(n) 238

Slide 239

Slide 239 text

SLOP MATCHING 239 quick brown fox quick fox quick fox ↳ Document Query Slop(1)

Slide 240

Slide 240 text

SLOP MATCHING 240 quick brown fox fox quick fox quick ↵ Document Query Slop(1) ↳ quick fox Slop(2) ↳ quick fox Slop(3)

Slide 241

Slide 241 text

3-5 FUZZY MATCHING 241

Slide 242

Slide 242 text

FUZZY MATCHING ▫︎quick brown fox → fast brown foxes ▫︎Johnny Walker → Johnnie Walker ▫︎Shcwarzenneger → Schwarzenegger 242

Slide 243

Slide 243 text

DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎One-character edits: ▫︎Substitution ▫︎Insertion ▫︎Deletion ▫︎Transposition of two adjacent characters 243

Slide 244

Slide 244 text

DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎One-character substitution: ▫︎ fox → box 244

Slide 245

Slide 245 text

DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Insertion of a new character: ▫︎sic → sick 245

Slide 246

Slide 246 text

DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Deletion of a character: ▫︎black → back 246

Slide 247

Slide 247 text

DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Transposition of two adjacent characters: ▫︎star → tsar 247

Slide 248

Slide 248 text

DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Converting bieber into beaver 1. Substitute: bieber → biever 2. Substitute: biever → baever 3. Transpose: baever → beaver ▫︎Edit distance of 3 248

Slide 249

Slide 249 text

FUZINESS ▫︎80% of human misspellings have an Edit Distance of 1 ▫︎Elasticsearch supports a maximum Edit Distance of 2 ▫︎fuziness operator 249

Slide 250

Slide 250 text

FUZZINESS EXAMPLE ./part-3/load-surprise-data.sh 250

Slide 251

Slide 251 text

GET /example/surprise/_search { "query": { "match": { "text": { "query": "surprize" } } } } QUERY WITHOUT FUZZINESS 251

Slide 252

Slide 252 text

QUERY WITHOUT FUZZINESS curl -X GET -d @part-3/surprize-query.json 'localhost:9200/example/ surprise/_search?pretty' 252

Slide 253

Slide 253 text

"hits": { "total": 0, "max_score": null, "hits": [ ] } QUERY WITHOUT FUZZINESS 253

Slide 254

Slide 254 text

GET /example/surprise/_search { "query": { "match": { "text": { "query": "surprize", "fuzziness": "1" } } } } QUERY WITH FUZZINESS 254

Slide 255

Slide 255 text

QUERY WITH FUZZINESS curl -X GET -d @part-3/surprize-fuzzy- query.json 'localhost:9200/example/ surprise/_search?pretty' 255

Slide 256

Slide 256 text

"hits": [ { "_index": "example", "_type": "surprise", "_id": "1", "_score": 0.19178301, "_source":{ "text": "Surprise me!"} }] QUERY WITH FUZZINESS 256

Slide 257

Slide 257 text

AUTO-FUZINESS ▫︎0 for strings of one or two characters ▫︎1 for strings of three, four or ﬁve characters ▫︎2 for strings of more than ﬁve characters 257

Slide 258

Slide 258 text

PART 4 Data modeling 258

Slide 259

Slide 259 text

4-1 INSIDE A CLUSTER 259

Slide 260

Slide 260 text

NODES AND CLUSTERS ▫︎A node is a machine running Elasticsearch ▫︎A cluster is a set of nodes in the same network and with the same cluster name 260

Slide 261

Slide 261 text

SHARDS ▫︎A node stores data inside its shards ▫︎Shards are the smallest unit of scale and replication ▫︎Each shard is a completely independent Lucene index 261

Slide 262

Slide 262 text

AN EMPTY CLUSTER 262

Slide 263

Slide 263 text

GET /_cluster/health CLUSTER HEALTH 263

Slide 264

Slide 264 text

"cluster_name": "elasticsearch", "status": "green", "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 0, "active_shards": 0, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 CLUSTER HEALTH 264

Slide 265

Slide 265 text

PUT /blogs "settings": { "number_of_shards": 3, "number_of_replicas": 1 } ADD AN INDEX 265

Slide 266

Slide 266 text

ADD AN INDEX 266

Slide 267

Slide 267 text

GET /_cluster/health CLUSTER HEALTH 267

Slide 268

Slide 268 text

"cluster_name": "elasticsearch", "status": "yellow", "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 CLUSTER HEALTH 268

Slide 269

Slide 269 text

ADD A BACKUP NODE 269

Slide 270

Slide 270 text

GET /_cluster/health CLUSTER HEALTH 270

Slide 271

Slide 271 text

"cluster_name": "elasticsearch", "status": "green", "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 3, "active_shards": 6, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 CLUSTER HEALTH 271

Slide 272

Slide 272 text

THREE NODES 272

Slide 273

Slide 273 text

PUT /blogs "settings": { "number_of_shards": 3, "number_of_replicas": 2 } INCREASING THE NUMBER OF REPLICAS 273

Slide 274

Slide 274 text

INCREASING THE NUMBER OF REPLICAS 274

Slide 275

Slide 275 text

NODE 1 FAILS 275

Slide 276

Slide 276 text

CREATING, INDEXING AND DELETING A DOCUMENT 276

Slide 277

Slide 277 text

RETRIEVING A DOCUMENT 277

Slide 278

Slide 278 text

4-2 RELATIONSHIPS 278

Slide 279

Slide 279 text

RELATIONSHIPS MATTER ▫︎Blog Posts 㲗 Comments ▫︎Bank Accounts 㲗 Transactions ▫︎Orders 㲗 Items ▫︎Directories 㲗 Files ▫︎… 279

Slide 280

Slide 280 text

SQL DATABASES ▫︎Entities have an unique primary key ▫︎Normalization: ▫︎Entity data is stored only once ▫︎Entities are referenced by primary key ▫︎Updates happen in only one place 280

Slide 281

Slide 281 text

▫︎Entities are joined at query time SQL DATABASES SELECT Customer.name, Order.status FROM Order, Customer WHERE Order.customer_id = Customer.id 281

Slide 282

Slide 282 text

SQL DATABASES ▫︎Changes are ACID ▫︎Atomicity ▫︎Consistency ▫︎Isolation ▫︎Durability 282

Slide 283

Slide 283 text

ATOMICITY ▫︎If one part of the transaction fails, the entire transaction fails ▫︎…even in the event of power failure, crashes or errors ▫︎"all or nothing” 283

Slide 284

Slide 284 text

CONSISTENCY ▫︎Any transaction will bring the database from one valid state to another ▫︎State must be valid according to all deﬁned rules: ▫︎Constraints ▫︎Cascades ▫︎Triggers 284

Slide 285

Slide 285 text

ISOLATION ▫︎The concurrent execution of transactions results in the same state that would be obtained if transactions were executed serially ▫︎Concurrency Control 285

Slide 286

Slide 286 text

DURABILITY ▫︎A transaction will remain committed ▫︎…even in the event of power failure, crashes or errors ▫︎Non-volatile memory 286

Slide 287

Slide 287 text

SQL DATABASES ▫︎Joining entities at query time is expensive ▫︎Impractical with multiple nodes 287

Slide 288

Slide 288 text

ELASTICSEARCH ▫︎Treats the world as ﬂat ▫︎An index is a ﬂat collection of independent documents ▫︎A single document should contain all information to match a search request 288

Slide 289

Slide 289 text

ELASTICSEARCH ▫︎ACID support for changes on single documents ▫︎No ACID transactions on multiple documents 289

Slide 290

Slide 290 text

ELASTICSEARCH ▫︎Indexing and searching are fast and lock-free ▫︎Massive amounts of data can be spread across multiple nodes 290

Slide 291

Slide 291 text

ELASTICSEARCH ▫︎But we need relationships! 291

Slide 292

Slide 292 text

ELASTICSEARCH ▫︎Application-side joins ▫︎Data denormalization ▫︎Nested objects ▫︎Parent/child relationships 292

Slide 293

Slide 293 text

4-3 APPLICATION-SIDE JOINS 293

Slide 294

Slide 294 text

APPLICATION-SIDE JOINS ▫︎Emulates a relational database ▫︎Joins at application level ▫︎(index, type, id) = primary key 294

Slide 295

Slide 295 text

PUT /example/user/1 { "name": "John Smith", "email": "[email protected]", "born": "1970-10-24" } EXAMPLE 295

Slide 296

Slide 296 text

PUT /example/blogpost/2 { "title": "Relationships", "body": "It's complicated", "user": 1 } EXAMPLE 296

Slide 297

Slide 297 text

EXAMPLE ▫︎(example, user, 1) = primary key ▫︎Store only the id ▫︎Index and type are hard-coded into the application logic 297

Slide 298

Slide 298 text

GET /example/blogpost/_search "query": { "filtered": { "filter": { "term": { "user": 1 } } } } EXAMPLE 298

Slide 299

Slide 299 text

EXAMPLE ▫︎Blogposts written by “John”: ▫︎Find ids of users with name “John” ▫︎Find blogposts that match the user ids 299

Slide 300

Slide 300 text

GET /example/user/_search "query": { "match": { "name": "John" } } EXAMPLE 300

Slide 301

Slide 301 text

▫︎For each user id from the ﬁrst query: GET /example/blogpost/_search "query": { "filtered": { "filter": { "term": { "user": } } } } EXAMPLE 301

Slide 302

Slide 302 text

ADVANTAGES ▫︎Data is normalized ▫︎Change user data in just one place 302

Slide 303

Slide 303 text

DISADVANTAGES ▫︎Run extra queries to join documents ▫︎We could have millions of users named “John” ▫︎Less eﬃcient than SQL joins: ▫︎Several API requests ▫︎Harder to optimize 303

Slide 304

Slide 304 text

WHEN TO USE ▫︎First entity has a small number of documents and they hardly change ▫︎First query results can be cached 304

Slide 305

Slide 305 text

4-4 DATA DENORMALIZATION 305

Slide 306

Slide 306 text

DATA DENORMALIZATION ▫︎No joins ▫︎Store redundant copies of the data you need to query 306

Slide 307

Slide 307 text

PUT /example/user/1 { "name": "John Smith", "email": "[email protected]", "born": "1970-10-24" } EXAMPLE 307

Slide 308

Slide 308 text

PUT /example/blogpost/2 { "title": "Relationships", "body": "It's complicated", "user": { "id": 1, "name": "John Smith" } } EXAMPLE 308

Slide 309

Slide 309 text

GET /example/blogpost/_search "query": { "bool": { "must": [ { "match": { "title": "relationships" }}, { "match": { "user.name": "John" }} ]}} EXAMPLE 309

Slide 310

Slide 310 text

ADVANTAGES ▫︎Speed ▫︎No need for expensive joins 310

Slide 311

Slide 311 text

DISADVANTAGES ▫︎Uses more disk space (cheap) ▫︎Update the same data in several places ▫︎scroll and bulk APIs can help ▫︎Concurrency issues ▫︎Locking can help 311

Slide 312

Slide 312 text

WHEN TO USE ▫︎Need for fast search ▫︎Denormalized data does not change very often 312

Slide 313

Slide 313 text

4-5 NESTED OBJECTS 313

Slide 314

Slide 314 text

MOTIVATION ▫︎Elasticsearch supports ACID when updating single documents ▫︎Querying related data in the same document is faster (no joins) ▫︎We want to avoid denormalization 314

Slide 315

Slide 315 text

PUT /example/blogpost/1 { "title": "Nest eggs", "body": "Making money...", "tags": [ "cash", "shares" ], "comments": […] } THE PROBLEM WITH MULTILEVEL OBJECTS 315

Slide 316

Slide 316 text

[{ "name": "John Smith", "comment": "Great article", "age": 28, "stars": 4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this", "age": 31,"stars": 5, "date": "2014-10-22" }] THE PROBLEM WITH MULTILEVEL OBJECTS 316

Slide 317

Slide 317 text

GET /example/blogpost/_search "query": { "bool": { "must": [ {"match": {"name": "Alice"}}, {"match": {"age": "28"}} ]}} THE PROBLEM WITH MULTILEVEL OBJECTS 317

Slide 318

Slide 318 text

Slide 319

Slide 319 text

THE PROBLEM WITH MULTILEVEL OBJECTS ▫︎Alice is 31, not 28! ▫︎It matched the age of John ▫︎This is because indexed documents are stored as a ﬂattened dictionary ▫︎The correlation between Alice and 31 is irretrievably lost 319

Slide 320

Slide 320 text

{"title": [eggs, nest], "body": [making, money], "tags": [cash, shares], "comments.name": [alice, john, smith, white], "comments.comment": [article, great, like, more, this], "comments.age": [28, 31], "comments.stars": [4, 5], "comments.date": [2014-09-01, 2014-10-22]} THE PROBLEM WITH MULTILEVEL OBJECTS 320

Slide 321

Slide 321 text

NESTED OBJECTS ▫︎Nested objects are indexed as hidden separate documents ▫︎Relationships are preserved ▫︎Joining nested documents is very fast 321

Slide 322

Slide 322 text

{"comments.name": [john, smith], "comments.comment": [article, great], "comments.age": [28], "comments.stars": [4], "comments.date": [2014-09-01]} {"comments.name": [alice, white], "comments.comment": [like, more, this], "comments.age": [31], "comments.stars": [5], "comments.date": [2014-10-22]} NESTED OBJECTS 322

Slide 323

Slide 323 text

{ "title": [eggs, nest], "body": [making, money], "tags": [cash, shares] } NESTED OBJECTS 323

Slide 324

Slide 324 text

NESTED OBJECTS ▫︎Need to be enabled by updating the mapping of the index 324

Slide 325

Slide 325 text

PUT /example "mappings": { "blogpost": { "properties": { "comments": { "type": "nested", "properties": { "name": {"type": "string"}, "comment": {"type": "string"}, "age": {"type": "short"}, "stars": {"type":"short"}, "date": {"type": "date"} }}}}} MAPPING A NESTED OBJECT 325

Slide 326

Slide 326 text

GET /example/blogpost/_search "query": { "bool": { "must": [ {"match": {"title": "eggs"}} {"nested": } ] } } QUERYING A NESTED OBJECT 326

Slide 327

Slide 327 text

"nested": { "path": "comments", "query": { "bool": { "must": [ {"match": {"comments.name": "john"}}, {"match": {"comments.age": 28}} ]}}} NESTED QUERY 327

Slide 328

Slide 328 text

THERE’S MORE ▫︎Nested ﬁlters ▫︎Nested aggregations ▫︎Sorting by nested ﬁelds 328

Slide 329

Slide 329 text

ADVANTAGES ▫︎Very fast query-time joins ▫︎ACID support (single documents) ▫︎Convenient search using nested queries 329

Slide 330

Slide 330 text

DISADVANTAGES ▫︎To add, change or delete a nested object, the whole document must be reindexed ▫︎Search requests return the whole document 330

Slide 331

Slide 331 text

WHEN TO USE ▫︎When there is one main entity with a limited number of closely related entities ▫︎Ex: blogposts and comments ▫︎Ineﬃcient if there are too many nested objects 331

Slide 332

Slide 332 text

4-6 PARENT-CHILD RELATIONSHIP 332

Slide 333

Slide 333 text

PARENT-CHILD RELATIONSHIP ▫︎One-to-many relationship ▫︎Similar to the nested model ▫︎Nested objects live in the same document ▫︎Parent and children are completely separate documents 333

Slide 334

Slide 334 text

EXAMPLE ▫︎Company with branches and employees ▫︎Branch is the parent ▫︎Employee are children 334

Slide 335

Slide 335 text

PUT /company "mappings": { "branch": {}, "employee": { "_parent": { "type": "branch" } } } EXAMPLE 335

Slide 336

Slide 336 text

PUT /company/branch/london { "name": "London Westminster", "city": "London", "country": "UK" } EXAMPLE 336

Slide 337

Slide 337 text

PUT /company/employee/1? parent=london { "name": "Alice Smith", "born": "1970-10-24", "hobby": "hiking" } EXAMPLE 337

Slide 338

Slide 338 text

GET /company/branch/_search "query": { "has_child": { "type": "employee", "query": { "range": { "born": { "gte": "1980-01-01" } }}}} FINDING PARENTS BY THEIR CHILDREN 338

Slide 339

Slide 339 text

GET /company/employee/_search "query": { "has_parent": { "type": "branch", "query": { "match": { "country": "UK" } }}} FINDING CHILDREN BY THEIR PARENTS 339

Slide 340

Slide 340 text

THERE’S MORE ▫︎min_children and max_children ▫︎Children aggregations ▫︎Grandparents and grandchildren 340

Slide 341

Slide 341 text

ADVANTAGES ▫︎Parent document can be updated without reindexing the children ▫︎Child documents can be updated without aﬀecting the parent ▫︎Child documents can be returned in search results without the parent 341

Slide 342

Slide 342 text

ADVANTAGES ▫︎Parent and children live on the same shard ▫︎Faster than application-side joins 342

Slide 343

Slide 343 text

DISADVANTAGES ▫︎Parent document and all of its children must live on the same shard ▫︎5 to 10 times slower than nested queries 343

Slide 344

Slide 344 text

WHEN TO USE ▫︎One-to-many relationships ▫︎When index-time is more important than search-time performance ▫︎Otherwise, use nested objects 344

Slide 345

Slide 345 text

REFERENCES 345

Slide 346

Slide 346 text

MAIN REFERENCE ▫︎Elasticsearch, The Deﬁnitive guide ▫︎Gormley & Tong ▫︎O'Reilly 346

Slide 347

Slide 347 text

OTHER REFERENCES ▫︎"Jepsen: simulating network partitions in DBs", http://github.com/aphyr/jepsen ▫︎"Call me maybe: Elasticsearch 1.5.0", http://aphyr.com/posts/323-call-me- maybe-elasticsearch-1-5-0 ▫︎"Call me maybe: MongoDB stale reads", http://aphyr.com/posts/322-call-me- maybe-mongodb-stale-reads 347

Slide 348

Slide 348 text

OTHER REFERENCES ▫︎"Elasticsearch Data Resiliency Status", http://www.elastic.co/guide/en/ elasticsearch/resiliency/current/ index.html ▫︎"Solr vs. Elasticsearch — How to Decide?", http://blog.sematext.com/2015/01/30/ solr-elasticsearch-comparison/ 348

Slide 349

Slide 349 text

OTHER REFERENCES ▫︎"Changing Mapping with Zero Downtime", http://www.elastic.co/blog/changing- mapping-with-zero-downtime 349

Slide 350

Slide 350 text

Felipe Dornelas felipedornelas.com @felipead THANK YOU