Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch Workshop

Elasticsearch Workshop

A broad and very hands-on Elasticsearch overview in ~4 hours. You're going to learn the core fundamentals of Elasticsearch and also get a glimpse on important Information Retrieval and Distributed Systems concepts.

Part 1 - Core Concepts
Part 2 - Search & Analytics
Part 3 - Dealing with Human Language
Part 4 - Data Modeling

Please download the examples at http://github.com/felipead/elasticsearch-workshop

Felipe Dornelas

January 05, 2016
Tweet

More Decks by Felipe Dornelas

Other Decks in Technology

Transcript

  1. W o r k s h o p ELASTICSEARCH Felipe

    Dornelas
  2. AGENDA ▫︎Part 1 ▫︎Introduction ▫︎Document Store ▫︎Search Examples ▫︎Data Resiliency

    ▫︎Comparison with Solr ▫︎Part 2 ▫︎Search ▫︎Analytics 2
  3. AGENDA ▫︎Part 3 ▫︎Inverted Index ▫︎Analyzers ▫︎Mapping ▫︎Proximity Matching ▫︎Fuzzy

    Matching ▫︎Part 4 ▫︎Inside a Cluster ▫︎Data Modeling 3
  4. → github.com/felipead/ elasticsearch-workshop 4

  5. PRE-REQUISITES ▫︎Vagrant ▫︎VirtualBox ▫︎Git 5

  6. ENVIRONMENT SETUP ▫︎git clone https://github.com/ felipead/elasticsearch-workshop.git ▫︎vagrant up ▫︎vagrant ssh

    ▫︎cd /vagrant 6
  7. VERIFY EVERYTHING IS WORKING ▫︎curl http://localhost:9200 7

  8. PART 1 Core concepts 8

  9. 1-1 INTRODUCTION You know, for search 9

  10. WHAT IS ELASTICSEARCH? A real-time distributed search and analytics engine

    10
  11. IT CAN BE USED FOR ▫︎Full-text search ▫︎Structured search ▫︎Real-time

    analytics ▫︎…or any combination of the above 11
  12. FEATURES ▫︎Distributed document store: ▫︎RESTful API ▫︎Automatic scale ▫︎Plug &

    Play ™ 12
  13. FEATURES ▫︎Handles the human language: ▫︎Score results by relevance ▫︎Synonyms

    ▫︎Typos and misspellings ▫︎Internationalization 13
  14. FEATURES ▫︎Powerful analytics: ▫︎Comprehensive aggregations ▫︎Geolocations ▫︎Can be combined with

    search ▫︎Real-time (no batch-processing) 14
  15. FEATURES ▫︎Free and open source ▫︎Community support ▫︎Backed by Elastic

    15
  16. MOTIVATION Most databases are inept at extracting knowledge from your

    data 16
  17. SQL DATABASES SQL = Structured Query Language 17

  18. SQL DATABASES ▫︎Can only filter by exact values ▫︎Unable to

    perform full-text search ▫︎Queries can be complex and inefficient ▫︎Often requires big-batch processing 18
  19. APACHE LUCENE ▫︎Arguably, the best search engine ▫︎High performance ▫︎Near

    real-time indexing ▫︎Open source 19
  20. APACHE LUCENE ▫︎But… ▫︎It’s just a Java Library ▫︎Hard to

    use 20
  21. ELASTICSEARCH ▫︎Document Store ▫︎Distributed ▫︎Scalable ▫︎Real Time ▫︎Analytics ▫︎RESTful API

    ▫︎Easy to Use 21
  22. DOCUMENT ORIENTED ▫︎Documents instead of rows / columns ▫︎Every field

    is indexed and searchable ▫︎Serialized to JSON ▫︎Schemaless 22
  23. WHO USES ▫︎GitHub ▫︎Wikipedia ▫︎Stack Overflow ▫︎The Guardian 23

  24. TALKING TO ELASTICSEARCH ▫︎Java API ▫︎Port 9300 ▫︎Native transport protocol

    ▫︎Node client (joins the cluster) ▫︎Transport client (doesn't join the cluster) 24
  25. TALKING TO ELASTICSEARCH ▫︎RESTful API ▫︎Port 9200 ▫︎JSON over HTTP

    25
  26. TALKING TO ELASTICSEARCH We will only cover the RESTful API

    26
  27. USING CURL curl -X <VERB> <URL> -d <BODY> or curl

    -X <VERB> <URL> -d @<FILE> 27
  28. THE EMPTY QUERY curl -X GET -d @part-1/empty-query.json localhost:9200/_count?pretty 28

  29. REQUEST { "query": { "match_all": {} } } 29

  30. RESPONSE { "count": 0, "_shards": { "total": 0, "successful": 0,

    "failed": 0 } } 30
  31. 1-2 DOCUMENT STORE 31

  32. THE PROBLEM WITH RELATIONAL DATABASES ▫︎Stores data in columns and

    rows ▫︎Equivalent of using a spreadsheet ▫︎Inflexible storage medium ▫︎Not suitable for rich objects 32
  33. DOCUMENTS { "name": "John Smith", "age": 42, "confirmed": true, "join_date":

    "2015-06-01", "home": {"lat": 51.5, "lon": 0.1}, "accounts": [ {"type": "facebook", "id": "johnsmith"}, {"type": "twitter", "id": "johnsmith"} ] } 33
  34. DOCUMENT METADATA ▫︎Index - Where the document lives ▫︎Type -

    Class of object that the document represents ▫︎Id - Unique identifier for the document 34
  35. DOCUMENT METADATA 35 Relational DB Databases Tables Rows Columns Elasticsearch

    Indices Types Documents Fields
  36. RESTFUL API [VERB] /{index}/{type}/{id}?pretty GET | POST | PUT |

    DELETE | HEAD 36
  37. RESTFUL API ▫︎JSON-only ▫︎Adding pretty to the query-string parameters pretty-prints

    the response 37
  38. INDEXING A DOCUMENT WITH YOUR OWN ID PUT /{index}/{type}/{id} 38

  39. INDEXING A DOCUMENT WITH YOUR OWN ID curl -X PUT

    -d @part-1/first-blog-post.json localhost:9200/blog/post/123?pretty 39
  40. REQUEST { "title": "My first blog post", "text": "Just trying

    this out...", "date": "2014-01-01" } 40
  41. RESPONSE { "_index" : "blog", "_type" : "post", "_id" :

    "123", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } 41
  42. INDEXING A DOCUMENT WITH AUTOGENERATED ID POST /{index}/{type} * Autogenerated

    IDs are Base64-encoded UUIDs 42
  43. INDEXING A DOCUMENT WITH AUTOGENERATED ID curl -X POST -d

    @part-1/second-blog-post.json localhost:9200/blog/post?pretty 43
  44. REQUEST { "title": "Second blog post", "text": "Still trying this

    out...", "date": "2014-01-01" } 44
  45. RESPONSE { "_index" : "blog", "_type" : "post", "_id" :

    "AVFWIbMf7YZ6Se7RwMws", "_version" : 1, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true } 45
  46. RETRIEVING A DOCUMENT WITH METADATA GET /{index}/{type}/{id} 46

  47. RETRIEVING A DOCUMENT WITH METADATA curl -X GET localhost:9200/blog/post/123?pretty 47

  48. RESPONSE { "_index" : "blog", "_type" : "post", "_id" :

    "123", "_version" : 1, "found" : true, "_source": { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014-01-01" } } 48
  49. RETRIEVING A DOCUMENT WITHOUT METADATA GET /{index}/{type}/{id}/_source 49

  50. RETRIEVING A DOCUMENT WITHOUT METADATA curl -X GET localhost:9200/blog/post/123/ _source?pretty

    50
  51. RESPONSE { "title": "My first blog entry", "text": "Just trying

    this out...", "date": "2014-01-01" } 51
  52. RETRIEVING PART OF A DOCUMENT GET /{index}/{type}/{id} ?_source={fields} 52

  53. RETRIEVING PART OF A DOCUMENT curl -X GET 'localhost:9200/blog/post/123? _source=title,date&pretty'

    53
  54. RESPONSE { "_index" : "blog", "_type" : "post", "_id" :

    "123", "_version" : 1, "found" : true, "_source": { "title": "My first blog entry", "date": "2014-01-01" } } 54
  55. CHECKING WHETHER A DOCUMENT EXISTS HEAD /{index}/{type}/{id} 55

  56. CHECKING WHETHER A DOCUMENT EXISTS curl -i —X HEAD localhost:9200/blog/post/123

    56
  57. RESPONSE HTTP/1.1 200 OK Content-Length: 0 57

  58. CHECKING WHETHER A DOCUMENT EXISTS curl -i —X HEAD localhost:9200/blog/post/666

    58
  59. RESPONSE HTTP/1.1 404 Not Found Content-Length: 0 59

  60. UPDATING A WHOLE DOCUMENT PUT /{index}/{type}/{id} 60

  61. UPDATING A WHOLE DOCUMENT curl -X PUT -d @part-1/updated-blog-post.json localhost:9200/blog/post/123?pretty

    61
  62. REQUEST { "title": "My first blog post", "text": "I am

    starting to get the hang of this...", "date": "2014-01-02" } 62
  63. RESPONSE { "_index" : "blog", "_type" : "post", "_id" :

    "123", "_version" : 2, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : false } 63
  64. DELETING A DOCUMENT DELETE /{index}/{type}/{id} 64

  65. DELETING A DOCUMENT curl -X DELETE localhost:9200/blog/post/123?pretty 65

  66. RESPONSE { "found" : true, "_index" : "blog", "_type" :

    "post", "_id" : "123", "_version" : 3, "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 } } 66
  67. DEALING WITH CONFLICTS 67

  68. PESSIMISTIC CONCURRENCY CONTROL ▫︎Used by relational databases ▫︎Assumes conflicts are

    likely to happen (pessimist) ▫︎Blocks access to resources 68
  69. OPTIMISTIC CONCURRENCY CONTROL ▫︎Assumes conflicts are unlikely to happen (optimist)

    ▫︎Does not block operations ▫︎If conflict happens, update fails 69
  70. HOW ELASTICSEARCH DEALS WITH CONFLICTS ▫︎Locking distributed resources would be

    very inefficient ▫︎Uses Optimistic Concurrency Control ▫︎Auto-increments _version number 70
  71. HOW ELASTICSEARCH DEALS WITH CONFLICTS ▫︎PUT /blog/post/123?version=1 ▫︎If version is

    outdated returns 409 Conflict 71
  72. 1-3 SEARCH EXAMPLES 72

  73. EMPLOYEE DIRECTORY EXAMPLE ▫︎Index: megacorp ▫︎Type: employee ▫︎Ex: John Smith,

    Jane Smith, Douglas Fir 73
  74. EMPLOYEE DIRECTORY EXAMPLE curl -X PUT -d @part-1/john-smith.json localhost:9200/megacorp/employee/1 74

  75. REQUEST { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I

    love to go rock climbing", "interests": ["sports", "music"] } 75
  76. EMPLOYEE DIRECTORY EXAMPLE curl -X PUT -d @part-1/jane-smith.json localhost:9200/megacorp/employee/2 76

  77. REQUEST { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I

    like to collect rock albums", "interests": ["music"] } 77
  78. EMPLOYEE DIRECTORY EXAMPLE curl -X PUT -d @part-1/douglas-fir.json localhost:9200/megacorp/employee/3 78

  79. REQUEST { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I

    like to build cabinets", "interests": ["forestry"] } 79
  80. SEARCHES ALL EMPLOYEES GET /megacorp/employee/_search 80

  81. SEARCHES ALL EMPLOYEES curl -X GET localhost:9200/megacorp/employee/ _search?pretty 81

  82. SEARCH WITH QUERY-STRING GET /megacorp/employee/_search ?q=last_name:Smith 82

  83. SEARCH WITH QUERY-STRING curl -X GET 'localhost:9200/megacorp/employee/ _search?q=last_name:Smith&pretty' 83

  84. RESPONSE "hits" : { "total" : 2, "max_score" : 0.30685282,

    "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", … } }, { … "_score" : 0.30685282, "_source": { "first_name": "John", "last_name": "Smith", … } } ] } 84
  85. SEARCH WITH QUERY DSL curl -X GET -d @part-1/last-name-query.json localhost:9200/megacorp/employee/

    _search?pretty 85
  86. REQUEST { "query": { "match": { "last_name": "Smith" } }

    } 86
  87. RESPONSE "hits" : { "total" : 2, "max_score" : 0.30685282,

    "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", … } }, { … "_score" : 0.30685282, "_source": { "first_name": "John", "last_name": "Smith", … } } ] } 87
  88. SEARCH WITH QUERY DSL AND FILTER curl -X GET -d

    @part-1/last-name-age-query.json localhost:9200/megacorp/employee/ _search?pretty 88
  89. REQUEST "query": { "filtered": { "filter": { "range": { "age":

    { "gt": 30 } } }, "query": { "match": { "last_name": "Smith" } } } } 89
  90. RESPONSE "hits" : { "total" : 1, "max_score" : 0.30685282,

    "hits" : [ { … "_score" : 0.30685282, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, … } } ] 90
  91. FULL-TEXT SEARCH curl -X GET -d @part-1/full-text—search.json localhost:9200/megacorp/employee/ _search?pretty 91

  92. REQUEST { "query": { "match": { "about": "rock climbing" }

    } } 92
  93. RESPONSE "hits" : [{ … "_score" : 0.16273327, "_source": {

    "first_name": "John", "last_name": "Smith", "about": "I love to go rock climbing", … } }, { … "_score" : 0.016878016, "_source": { "first_name": "Jane", "last_name": "Smith", "about": "I like to collect rock albums", … } }] 93
  94. RELEVANCE SCORES ▫︎The _score field ranks searches results ▫︎The higher

    the score, the better 94
  95. PHRASE SEARCH curl -X GET -d @part-1/phrase-search.json localhost:9200/megacorp/employee/ _search?pretty 95

  96. REQUEST { "query": { "match_phrase": { "about": "rock climbing" }

    } } 96
  97. RESPONSE "hits" : { "total" : 1, "max_score" : 0.23013961,

    "hits" : [ { … "_score" : 0.23013961, "_source": { "first_name": "John", "last_name": "Smith", "about": "I love to go rock climbing" … } } ] } 97
  98. 1-4 DATA RESILIENCY 98

  99. CALL ME MAYBE ▫︎Jepsen Tests ▫︎Simulates network partition scenarios ▫︎Run

    several operations against a distributed system ▫︎Verify that the history of those operations makes sense 99
  100. NETWORK PARTITION 100

  101. ELASTICSEARCH STATUS ▫︎Risk of data loss on network partition and

    split-brain scenarios 101
  102. IT IS NOT SO BAD… ▫︎Still much more resilient than

    MongoDB ▫︎Elastic is working hard to improve it ▫︎Two-phase commits are planned 102
  103. IF YOU REALLY CARE ABOUT YOUR DATA ▫︎Use a more

    reliable primary data store: ▫︎Cassandra ▫︎Postgres ▫︎Synchronize it to Elasticsearch ▫︎…or set-up comprehensive back-up 103
  104. There’s no such thing as a 100% reliable distributed system

    104
  105. 1-5 SOLR COMPARISON 105

  106. SOLR ▫︎SolrCloud ▫︎Both: ▫︎Are open-source and mature ▫︎Are based on

    Apache Lucene ▫︎Have more or less similar features 106
  107. SOLR API ▫︎HTTP GET ▫︎Query parameters passed in as URL

    parameters ▫︎Is not RESTful ▫︎Multiple formats (JSON, XML…) 107
  108. SOLR API ▫︎Version 4.4 added Schemaless API ▫︎Older versions require

    up-front Schema 108
  109. ELASTICSEARCH API ▫︎RESTful ▫︎Schemaless ▫︎CRUD document operations ▫︎Manage indices, read

    metrics, etc… 109
  110. ELASTICSEARCH API ▫︎Query DSL ▫︎Better readability ▫︎JSON-only 110

  111. SEARCH ▫︎Both are very good with text search ▫︎Both based

    on Apache Lucene 111
  112. EASYNESS OF USE ▫︎Elasticsearch is simpler: ▫︎Just a single process

    ▫︎Easier API ▫︎SolrCloud requires Apache ZooKeeper 112
  113. SOLRCLOUD DATA RESILIENCY ▫︎SolrCloud uses Apache ZooKeeper to discover nodes

    ▫︎Better at preventing split-brain conditions ▫︎Jepsen Tests pass 113
  114. ANALYTICS ▫︎Elasticsearch is the choice for analytics: ▫︎Comprehensive aggregations ▫︎Thousands

    of metrics ▫︎SolrCloud is not even close 114
  115. PART 2 Search and Analytics 115

  116. 2-1 SEARCH Finding the needle in the haystack 116

  117. TWEETS EXAMPLE ▫︎/<country_code>/user ▫︎/<country_code>/tweet 117

  118. TWEETS EXAMPLE /us/user/1 { "email": "[email protected]", "name": "John Smith", "username":

    "@john" } 118
  119. TWEETS EXAMPLE /gb/user/2 { "email": "[email protected]", "name": "Mary Jones", "username":

    "@mary" } 119
  120. TWEET EXAMPLE /gb/tweet/3 { "date": "2014-09-13", "name": "Mary Jones", "tweet":

    "Elasticsearch means full text search has never been so easy", "user_id": 2 } 120
  121. TWEETS EXAMPLE ./part-2/load-tweet-data.sh 121

  122. GET /_search ▫︎Returns all documents on all indices THE EMPTY

    SEARCH 122
  123. THE EMPTY SEARCH curl -X GET localhost:9200/_search?pretty 123

  124. THE EMPTY SEARCH "hits" : { "total" : 14, "hits"

    : [ { "_index": "us", "_type": "tweet", "_id": "7", "_score": 1, "_source": { "date": "2014-09-17", "name": "John Smith", "tweet": "The Query DSL is really powerful and flexible", "user_id": 2 } }, … 9 RESULTS REMOVED … ] } 124
  125. MULTI-INDEX, MULTITYPE SEARCH ▫︎/_search ▫︎/gb/_search ▫︎/gb,us/_search ▫︎/gb/user/_search ▫︎/_all/user,tweet/_search 125

  126. PAGINATION ▫︎Returns 10 results per request (default) ▫︎Control parameters: ▫︎size:

    number of results to return ▫︎from: number of results to skip 126
  127. PAGINATION ▫︎GET /_search?size=5 ▫︎GET /_search?size=5&from=5 ▫︎GET /_search?size=5&from=10 127

  128. TYPES OF SEARCH ▫︎Structured query on concrete fields (similar to

    SQL) ▫︎Full-text query (sorts results by relevance) ▫︎Combination of the two 128
  129. SEARCH BY EXACT VALUES ▫︎Examples: ▫︎date ▫︎user ID ▫︎username ▫︎“Does

    this document match the query?” 129
  130. SELECT * FROM user WHERE name = "John Smith" AND

    user_id = 2 AND date > "2014-09-15" ▫︎SQL queries: SEARCH BY EXACT VALUES 130
  131. FULL-TEXT SEARCH ▫︎Examples: ▫︎the text of a tweet ▫︎body of

    an email ▫︎“How well does this document match the query?” 131
  132. FULL-TEXT SEARCH ▫︎UK should also match United Kingdom ▫︎jump should

    also match jumped, jumps, jumping and leap 132
  133. FULL-TEXT SEARCH ▫︎fox news hunting should return stories about hunting

    on Fox News ▫︎fox hunting news should return news stories about fox hunting 133
  134. HOW ELASTICSEARCH PERFORMS TEXT SEARCH ▫︎Analyzes the text ▫︎Tokenizes into

    terms ▫︎Normalizes the terms ▫︎Builds an inverted index 134
  135. LIST OF INDEXED DOCUMENTS 135 ID Text 1 Baseball is

    played during summer months. 2 Summer is the time for picnics here. 3 Months later we found out why. 4 Why is summer so hot here.
  136. INVERTED INDEX 136 Term Frequency Document IDs baseball 1 1

    during 1 1 found 1 3 here 2 2, 4 hot 1 4 is 3 1, 2, 4 months 2 1, 3 summer 3 1, 2, 4 the 1 2 why 2 3, 4
  137. GET /_search { "query": YOUR_QUERY_HERE } QUERY DSL 137

  138. { "match": { "tweet": "elasticsearch" } } QUERY BY FIELD

    138
  139. QUERY BY FIELD curl -X GET -d @part-2/elasticsearch-tweets-query.json localhost:9200/_all/tweet/_search 139

  140. { "bool": "must": { "match": { "tweet": "elasticsearch"} }, "must_not":

    { "match": { "name": "mary" } }, "should": { "match": { "tweet": "full text" } } } QUERY WITH MULTIPLE CLAUSES 140
  141. QUERY WITH MULTIPLE CLAUSES curl -X GET -d @part-2/combining-tweet-queries.json localhost:9200/_all/tweet/_search

    141
  142. "_score": 0.07082729, "_source": { … "name": "John Smith", "tweet": "The

    Elasticsearch API is really easy to use" }, … "_score": 0.049890988, "_source": { … "name": "John Smith", "tweet": "Elasticsearch surely is one of the hottest new NoSQL products" }, … "_score": 0.03991279, "_source": { … "name": "John Smith", "tweet": "Elasticsearch and I have left the honeymoon stage, and I still love her." } QUERY WITH MULTIPLE CLAUSES 142
  143. MOST IMPORTANT QUERIES ▫︎match ▫︎match_all ▫︎multi_match ▫︎bool 143

  144. QUERIES VS. FILTERS ▫︎Queries: ▫︎full-text ▫︎“how well does the document

    match?” ▫︎Filters: ▫︎exact values ▫︎yes-no questions 144
  145. QUERIES VS. FILTERS ▫︎The goal of filters is to reduce

    the number of documents that have to be examined by a query 145
  146. PERFORMANCE COMPARISON ▫︎Filters are easy to cache and can be

    reused efficiently ▫︎Queries are heavier and non-cacheable 146
  147. WHEN TO USE WHICH ▫︎Use queries only for full-text search

    ▫︎Use filters for anything else 147
  148. "filtered": { "filter": { "term": { "user_id": 1 } }

    } FILTER BY EXACT FIELD VALUES 148
  149. FILTER BY EXACT FIELD VALUES curl -X GET -d @part-2/user-id—filter.json

    localhost:9200/_search 149
  150. "filtered": { "filter": { "range": { "date": { "gte": "2014-09-20"

    } } } } FILTER BY EXACT FIELD VALUES 150
  151. FILTER BY EXACT FIELD VALUES curl -X GET -d @part-2/date—filter.json

    localhost:9200/_search 151
  152. MOST IMPORTANT FILTERS ▫︎term ▫︎terms ▫︎range ▫︎exists and missing ▫︎bool

    152
  153. "filtered": { "query": { "match": { "tweet": "elasticsearch" } },

    "filter": { "term": { "user_id": 1 } } } COMBINING QUERIES WITH FILTERS 153
  154. COMBINING QUERIES WITH FILTERS curl -X GET -d @part-2/filtered—tweet-query.json localhost:9200/_search

    154
  155. SORTING ▫︎Relevance score ▫︎The higher the score, the better ▫︎By

    default, results are returned in descending order of relevance ▫︎You can sort by any field 155
  156. RELEVANCE SCORE ▫︎Similarity algorithm ▫︎Term Frequency / Inverse Document Frequency

    (TF/IDF) 156
  157. RELEVANCE SCORE ▫︎Term frequency ▫︎How often does the term appear

    in the field? ▫︎The more often, the more relevant 157
  158. RELEVANCE SCORE ▫︎Inverse document frequency ▫︎How often does each term

    appear in the index? ▫︎The more often, the less relevant 158
  159. RELEVANCE SCORE ▫︎Field-length norm ▫︎How long is the field? ▫︎The

    longer it is, the less likely it is that words in the field will be relevant 159
  160. 2-2 ANALYTICS How many needles are in the haystack? 160

  161. SEARCH ▫︎Just looks for the needle in the haystack 161

  162. BUSINESS QUESTIONS ▫︎How many needles are in the haystack? ▫︎What

    is the needle average length? ▫︎What is the median length of the needles, by manufacturer? ▫︎How many needles were added to the haystack each month? 162
  163. BUSINESS QUESTIONS ▫︎What are your most popular needle manufactures? ▫︎Are

    there any anomalous clumps of needles? 163
  164. AGGREGATIONS ▫︎Answer Analytics questions ▫︎Can be combined with Search ▫︎Near

    real-time in Elasticsearch ▫︎SQL queries can take days 164
  165. AGGREGATIONS Buckets + Metrics 165

  166. BUCKETS ▫︎Collection of documents that meet a certain criteria ▫︎Can

    be nested inside other buckets 166
  167. BUCKETS ▫︎Employee 㱺 male or female bucket ▫︎San Francisco 㱺

    California bucket ▫︎2014-10-28 㱺 October bucket 167
  168. METRICS ▫︎Calculations on top of buckets ▫︎Answer the questions ▫︎Ex:

    min, max, mean, sum… 168
  169. EXAMPLE ▫︎Partition by country (bucket) ▫︎…then partition by gender (bucket)

    ▫︎…then partition by age ranges (bucket) ▫︎…calculate the average salary for each age range (metric) 169
  170. CAR TRANSACTIONS EXAMPLE ▫︎/cars/transactions 170

  171. CAR TRANSACTIONS EXAMPLE /cars/transactions/ AVFr1xbVmdUYWpF46Ps4 { "price" : 10000, "color"

    : "red", "make" : "honda", "sold" : "2014-10-28" } 171
  172. CAR TRANSACTIONS EXAMPLE ./part-2/load-car-data.sh 172

  173. { "aggs": { "colors": { "terms": { "fields": "color" }

    } } } BEST SELLING CAR COLOR 173
  174. BEST SELLING CAR COLOR curl -X GET -d @part-2/best-selling-car-color.json 'localhost:9200/cars/transactions/

    _search?search_type=count&pretty' 174
  175. "colors" : { "buckets" : [{ "key" : "red", "doc_count"

    : 16 }, { "key" : "blue", "doc_count" : 8 }, { "key" : "green", "doc_count" : 8 }] } BEST SELLING CAR COLOR 175
  176. { "aggs": { "colors": { "terms": { "field": "color" },

    "aggs": { "avg_price": { "avg": { "field": "price" } } } } } } AVERAGE CAR COLOR PRICE 176
  177. AVERAGE CAR COLOR PRICE curl -X GET -d @part-2/average-car—color-price.json 'localhost:9200/cars/transactions/

    _search?search_type=count&pretty' 177
  178. "colors" : { "buckets": [{ "key": "red", "doc_count": 16, "avg_price":

    { "value": 32500.0 } }, { "key": "blue", "doc_count": 8, "avg_price": { "value": 20000.0 } }, { "key": "green", "doc_count": 8, "avg_price": { "value": 21000.0 } }] } AVERAGE CAR COLOR PRICE 178
  179. BUILDING BAR CHARTS ▫︎Very easy to convert aggregations to charts

    and graphs ▫︎Ex: histograms and time-series 179
  180. { "aggs": { "price": { "histogram": { "field": "price", "interval":

    20000 }, "aggs": { "revenue": {"sum": {"field" : "price"}} } } } } CAR SALES REVENUE HISTOGRAM 180
  181. CAR SALES REVENUE HISTOGRAM curl -X GET -d @part-2/car-revenue-histogram.json 'localhost:9200/cars/transactions/

    _search?search_type=count&pretty' 181
  182. "price" : { "buckets": [ { "key": 0, "doc_count": 12,

    "revenue": {"value": 148000.0} }, { "key": 20000, "doc_count": 16, "revenue": {"value": 380000.0} }, { "key": 40000, "doc_count": 0, "revenue": {"value": 0.0} }, { "key": 60000, "doc_count": 0, "revenue": {"value": 0.0} }, { "key": 80000, "doc_count": 4, "revenue": {"value" : 320000.0} } ]} CAR SALES REVENUE HISTOGRAM 182
  183. CAR SALES REVENUE HISTOGRAM 183

  184. TIME-SERIES DATA ▫︎Data with a timestamp: ▫︎How many cars sold

    each month this year? ▫︎What was the price of this stock for the last 12 hours? ▫︎What was the average latency of our website every hour in the last week? 184
  185. { "aggs": { "sales": { "date_histogram": { "field": "sold", "interval":

    "month", "format": "yyyy-MM-dd" } } } } HOW MANY CARS SOLD PER MONTH? 185
  186. HOW MANY CARS SOLD PER MONTH? curl -X GET -d

    @part-2/car-sales-per-month.json 'localhost:9200/cars/transactions/ _search?search_type=count&pretty' 186
  187. "sales" : { "buckets" : [ {"key_as_string": "2014-01-01", "doc_count": 4},

    {"key_as_string": "2014-02-01", "doc_count": 4}, {"key_as_string": "2014-03-01", "doc_count": 0}, {"key_as_string": "2014-04-01", "doc_count": 0}, {"key_as_string": "2014-05-01", "doc_count": 4}, {"key_as_string": "2014-06-01", "doc_count": 0}, {"key_as_string": "2014-07-01", "doc_count": 4}, {"key_as_string": "2014-08-01", "doc_count": 4}, {"key_as_string": "2014-09-01", "doc_count": 0}, {"key_as_string": "2014-10-01", "doc_count": 4}, {"key_as_string": "2014-11-01", "doc_count": 8} ] } HOW MANY CARS SOLD PER MONTH? 187
  188. HOW MANY CARS SOLD PER MONTH? 188

  189. PART 3 Dealing with human language 189

  190. 3-1 INVERTED INDEX 190

  191. INVERTED INDEX ▫︎Data structure ▫︎Efficient full-text search 191

  192. EXAMPLE 192 The quick brown fox jumped over the lazy

    dog Quick brown foxes leap over lazy dogs in summer Document 1 Document 2
  193. TOKENIZATION 193 ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy",

    "dog"] ["Quick", "brown", "foxes", "leap", "over", "lazy", "dogs", "in", "summer"] Document 1 Document 2
  194. 194 Term Document 1 Document 2 Quick The brown dog

    dogs fox foxes in jumped lazy leap over quick summer the
  195. EXAMPLE ▫︎Searching for “quick brown” ▫︎Naive similarity algorithm: ▫︎Document 1

    is a better match 195 Term Document 1 Document 2 brown quick Total 2 1
  196. A FEW PROBLEMS ▫︎Quick and quick are the same word

    ▫︎fox and foxes are pretty similar ▫︎jumped and leap are synonyms 196
  197. NORMALIZATION ▫︎Quick lowercased to quick ▫︎foxes stemmed to fox ▫︎jumped

    and leap replaced by jump 197
  198. BETTER INVERTED INDEX 198 Term Document 1 Document 2 brown

    dog fox in jump lazy over quick summer the
  199. SEARCH INPUT ▫︎You can only find terms that exist in

    the inverted index ▫︎The query string is also normalized 199
  200. 3-2 ANALYZERS 200

  201. ANALYSIS ▫︎Tokenizes a block of text into terms ▫︎Normalizes terms

    to standard form ▫︎Improves searchability 201
  202. ANALYZERS ▫︎Pipeline: ▫︎Character filters ▫︎Tokenizer ▫︎Token filters 202

  203. BUILT-IN ANALYZERS ▫︎Standard analyzer ▫︎Language-specific analyzers ▫︎30+ languages supported 203

  204. GET /_analyze? analyzer=standard The quick brown fox jumped over the

    lazy dog. TESTING THE STANDARD ANALYZER 204
  205. TESTING THE STANDARD ANALYZER curl -X GET -d @part-3/quick-brown-fox.txt 'localhost:9200/_analyze?

    analyzer=standard&pretty' 205
  206. "tokens" : [ {"token": "the", …}, {"token": "quick", …}, {"token":

    "brown", …}, {"token": "fox", …}, {"token": "jumps", …}, {"token": "over", …}, {"token": "the", …}, {"token": "lazy", …}, {"token": "dog", …} ] TESTING THE STANDARD ANALYZER 206
  207. GET /_analyze?analyzer=english The quick brown fox jumped over the lazy

    dog. TESTING THE ENGLISH ANALYZER 207
  208. TESTING THE ENGLISH ANALYZER curl -X GET -d @part-3/quick-brown-fox.txt 'localhost:9200/_analyze?

    analyzer=english&pretty' 208
  209. "tokens" : [ {"token": "quick", …}, {"token": "brown", …}, {"token":

    "fox", …}, {"token": "jump", …}, {"token": "over", …}, {"token": "lazi", …}, {"token": "dog", …} ] TESTING THE ENGLISH ANALYZER 209
  210. GET /_analyze? analyzer=brazilian A rápida raposa marrom pulou sobre o

    cachorro preguiçoso. TESTING THE BRAZILIAN ANALYZER 210
  211. TESTING THE BRAZILIAN ANALYZER curl -X GET -d @part-3/raposa-rapida.txt 'localhost:9200/_analyze?

    analyzer=brazilian&pretty' 211
  212. "tokens" : [ {"token": "rap", …}, {"token": "rapos", …}, {"token":

    "marrom", …}, {"token": "pul", …}, {"token": "cachorr", …}, {"token": "preguic", …} ] TESTING THE BRAZILIAN ANALYZER 212
  213. STEMMERS ▫︎Algorithmic stemmers: ▫︎Faster ▫︎Less precise ▫︎Dictionary stemmers: ▫︎Slower ▫︎More

    precise 213
  214. 3-3 MAPPING 214

  215. MAPPING ▫︎Every document has a type ▫︎Every type has its

    own mapping ▫︎A mapping defines: ▫︎The fields ▫︎The datatype for each field 215
  216. MAPPING ▫︎Elasticsearch guesses the mapping when a new field is

    added ▫︎Should customize the mapping for improved search and performance ▫︎Must customize the mapping when type is created 216
  217. MAPPING ▫︎A field's mapping cannot be changed ▫︎You can still

    add new fields ▫︎Only option is to reindex all documents ▫︎Reindexing with zero-downtime: ▫︎index aliases 217
  218. CORE FIELD TYPES ▫︎String ▫︎Integer ▫︎Floating-point ▫︎Boolean ▫︎Date ▫︎Inner Objects

    218
  219. GET /{index}/_mapping/{type} VIEWING THE MAPPING 219

  220. VIEWING THE MAPPING curl -X GET 'localhost:9200/gb/_mapping/ tweet?pretty' 220

  221. "date": { "type": "date", "format": "strict_date_optional_time…" }, "name": { "type":

    "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" } VIEWING THE MAPPING 221
  222. CUSTOMIZING FIELD MAPPINGS ▫︎Distinguish between: ▫︎Full-text string fields ▫︎Exact value

    string fields ▫︎Use language-specific analyzers 222
  223. STRING MAPPING ATTRIBUTES ▫︎index: ▫︎analyzed (full-text search, default) ▫︎not_analyzed (exact

    value) ▫︎analyzer: ▫︎standard (default) ▫︎english ▫︎… 223
  224. PUT /gb,us/_mapping/tweet { "properties": { "description": { "type": "string", "index":

    "analyzed", "analyzer": "english" } } } ADDING NEW SEARCHABLE FIELD 224
  225. ADDING NEW SEARCHABLE FIELD curl -X PUT -d @part-3/add-new-mapping.json 'localhost:9200/gb,us/

    _mapping/tweet?pretty' 225
  226. ADDING NEW SEARCHABLE FIELD curl -X GET 'localhost:9200/us,gb/ _mapping/tweet?pretty' 226

  227. … "description": { "type": "string", "analyzer": "english" }… ADDING NEW

    SEARCHABLE FIELD 227
  228. 3-4 PROXIMITY MATCHING 228

  229. THE PROBLEM ▫︎Sue ate the alligator ▫︎The alligator ate Sue

    ▫︎Sue never goes anywhere without her alligator-skin purse 229
  230. THE PROBLEM ▫︎Search for “sue alligator” would match all three

    ▫︎Sue and alligator may be separated by paragraphs of other text 230
  231. HEURISTIC ▫︎Words that appear near each other are probably related

    ▫︎Give documents in which the words are close together a higher relevance score 231
  232. GET /_analyze? analyzer=standard Quick brown fox. TERM POSITIONS 232

  233. "tokens": [ { "token": "quick", … "position": 1 }, {

    "token": "brown", … "position": 2 }, { "token": "fox", … "position": 3 } ] TERM POSITIONS 233
  234. GET /{index}/{type}/_search { "query": { "match_phrase": { "title": "quick brown

    fox" } } } EXACT PHRASE MATCHING 234
  235. EXACT PHRASE MATCHING ▫︎quick, brown and fox must all appear

    ▫︎The position of brown must be 1 greater than the position of quick ▫︎The position of fox must be 2 greater than the position of quick 235 quick brown fox
  236. FLEXIBLE PHRASE MATCHING ▫︎Exact phrase matching is too strict ▫︎“quick

    fox” should also match ▫︎Slop matching 236 quick brown fox
  237. "query": { "match_phrase": { "title": { "query": "quick fox", "slop":

    1 } } } FLEXIBLE PHRASE MATCHING 237
  238. SLOP MATCHING ▫︎How many times you are allowed to move

    a term in order to make the query and document match? ▫︎Slop(n) 238
  239. SLOP MATCHING 239 quick brown fox quick fox quick fox

    ↳ Document Query Slop(1)
  240. SLOP MATCHING 240 quick brown fox fox quick fox quick

    ↵ Document Query Slop(1) ↳ quick fox Slop(2) ↳ quick fox Slop(3)
  241. 3-5 FUZZY MATCHING 241

  242. FUZZY MATCHING ▫︎quick brown fox → fast brown foxes ▫︎Johnny

    Walker → Johnnie Walker ▫︎Shcwarzenneger → Schwarzenegger 242
  243. DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎One-character edits: ▫︎Substitution ▫︎Insertion ▫︎Deletion ▫︎Transposition of

    two adjacent characters 243
  244. DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎One-character substitution: ▫︎ fox → box 244

  245. DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Insertion of a new character: ▫︎sic →

    sick 245
  246. DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Deletion of a character: ▫︎black → back

    246
  247. DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Transposition of two adjacent characters: ▫︎star →

    tsar 247
  248. DAMERAU-LEVENSHTEIN EDIT DISTANCE ▫︎Converting bieber into beaver 1. Substitute: bieber

    → biever 2. Substitute: biever → baever 3. Transpose: baever → beaver ▫︎Edit distance of 3 248
  249. FUZINESS ▫︎80% of human misspellings have an Edit Distance of

    1 ▫︎Elasticsearch supports a maximum Edit Distance of 2 ▫︎fuziness operator 249
  250. FUZZINESS EXAMPLE ./part-3/load-surprise-data.sh 250

  251. GET /example/surprise/_search { "query": { "match": { "text": { "query":

    "surprize" } } } } QUERY WITHOUT FUZZINESS 251
  252. QUERY WITHOUT FUZZINESS curl -X GET -d @part-3/surprize-query.json 'localhost:9200/example/ surprise/_search?pretty'

    252
  253. "hits": { "total": 0, "max_score": null, "hits": [ ] }

    QUERY WITHOUT FUZZINESS 253
  254. GET /example/surprise/_search { "query": { "match": { "text": { "query":

    "surprize", "fuzziness": "1" } } } } QUERY WITH FUZZINESS 254
  255. QUERY WITH FUZZINESS curl -X GET -d @part-3/surprize-fuzzy- query.json 'localhost:9200/example/

    surprise/_search?pretty' 255
  256. "hits": [ { "_index": "example", "_type": "surprise", "_id": "1", "_score":

    0.19178301, "_source":{ "text": "Surprise me!"} }] QUERY WITH FUZZINESS 256
  257. AUTO-FUZINESS ▫︎0 for strings of one or two characters ▫︎1

    for strings of three, four or five characters ▫︎2 for strings of more than five characters 257
  258. PART 4 Data modeling 258

  259. 4-1 INSIDE A CLUSTER 259

  260. NODES AND CLUSTERS ▫︎A node is a machine running Elasticsearch

    ▫︎A cluster is a set of nodes in the same network and with the same cluster name 260
  261. SHARDS ▫︎A node stores data inside its shards ▫︎Shards are

    the smallest unit of scale and replication ▫︎Each shard is a completely independent Lucene index 261
  262. AN EMPTY CLUSTER 262

  263. GET /_cluster/health CLUSTER HEALTH 263

  264. "cluster_name": "elasticsearch", "status": "green", "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 0,

    "active_shards": 0, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 CLUSTER HEALTH 264
  265. PUT /blogs "settings": { "number_of_shards": 3, "number_of_replicas": 1 } ADD

    AN INDEX 265
  266. ADD AN INDEX 266

  267. GET /_cluster/health CLUSTER HEALTH 267

  268. "cluster_name": "elasticsearch", "status": "yellow", "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3,

    "active_shards": 3, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 CLUSTER HEALTH 268
  269. ADD A BACKUP NODE 269

  270. GET /_cluster/health CLUSTER HEALTH 270

  271. "cluster_name": "elasticsearch", "status": "green", "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 3,

    "active_shards": 6, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 CLUSTER HEALTH 271
  272. THREE NODES 272

  273. PUT /blogs "settings": { "number_of_shards": 3, "number_of_replicas": 2 } INCREASING

    THE NUMBER OF REPLICAS 273
  274. INCREASING THE NUMBER OF REPLICAS 274

  275. NODE 1 FAILS 275

  276. CREATING, INDEXING AND DELETING A DOCUMENT 276

  277. RETRIEVING A DOCUMENT 277

  278. 4-2 RELATIONSHIPS 278

  279. RELATIONSHIPS MATTER ▫︎Blog Posts 㲗 Comments ▫︎Bank Accounts 㲗 Transactions

    ▫︎Orders 㲗 Items ▫︎Directories 㲗 Files ▫︎… 279
  280. SQL DATABASES ▫︎Entities have an unique primary key ▫︎Normalization: ▫︎Entity

    data is stored only once ▫︎Entities are referenced by primary key ▫︎Updates happen in only one place 280
  281. ▫︎Entities are joined at query time SQL DATABASES SELECT Customer.name,

    Order.status FROM Order, Customer WHERE Order.customer_id = Customer.id 281
  282. SQL DATABASES ▫︎Changes are ACID ▫︎Atomicity ▫︎Consistency ▫︎Isolation ▫︎Durability 282

  283. ATOMICITY ▫︎If one part of the transaction fails, the entire

    transaction fails ▫︎…even in the event of power failure, crashes or errors ▫︎"all or nothing” 283
  284. CONSISTENCY ▫︎Any transaction will bring the database from one valid

    state to another ▫︎State must be valid according to all defined rules: ▫︎Constraints ▫︎Cascades ▫︎Triggers 284
  285. ISOLATION ▫︎The concurrent execution of transactions results in the same

    state that would be obtained if transactions were executed serially ▫︎Concurrency Control 285
  286. DURABILITY ▫︎A transaction will remain committed ▫︎…even in the event

    of power failure, crashes or errors ▫︎Non-volatile memory 286
  287. SQL DATABASES ▫︎Joining entities at query time is expensive ▫︎Impractical

    with multiple nodes 287
  288. ELASTICSEARCH ▫︎Treats the world as flat ▫︎An index is a

    flat collection of independent documents ▫︎A single document should contain all information to match a search request 288
  289. ELASTICSEARCH ▫︎ACID support for changes on single documents ▫︎No ACID

    transactions on multiple documents 289
  290. ELASTICSEARCH ▫︎Indexing and searching are fast and lock-free ▫︎Massive amounts

    of data can be spread across multiple nodes 290
  291. ELASTICSEARCH ▫︎But we need relationships! 291

  292. ELASTICSEARCH ▫︎Application-side joins ▫︎Data denormalization ▫︎Nested objects ▫︎Parent/child relationships 292

  293. 4-3 APPLICATION-SIDE JOINS 293

  294. APPLICATION-SIDE JOINS ▫︎Emulates a relational database ▫︎Joins at application level

    ▫︎(index, type, id) = primary key 294
  295. PUT /example/user/1 { "name": "John Smith", "email": "[email protected]", "born": "1970-10-24"

    } EXAMPLE 295
  296. PUT /example/blogpost/2 { "title": "Relationships", "body": "It's complicated", "user": 1

    } EXAMPLE 296
  297. EXAMPLE ▫︎(example, user, 1) = primary key ▫︎Store only the

    id ▫︎Index and type are hard-coded into the application logic 297
  298. GET /example/blogpost/_search "query": { "filtered": { "filter": { "term": {

    "user": 1 } } } } EXAMPLE 298
  299. EXAMPLE ▫︎Blogposts written by “John”: ▫︎Find ids of users with

    name “John” ▫︎Find blogposts that match the user ids 299
  300. GET /example/user/_search "query": { "match": { "name": "John" } }

    EXAMPLE 300
  301. ▫︎For each user id from the first query: GET /example/blogpost/_search

    "query": { "filtered": { "filter": { "term": { "user": <ID> } } } } EXAMPLE 301
  302. ADVANTAGES ▫︎Data is normalized ▫︎Change user data in just one

    place 302
  303. DISADVANTAGES ▫︎Run extra queries to join documents ▫︎We could have

    millions of users named “John” ▫︎Less efficient than SQL joins: ▫︎Several API requests ▫︎Harder to optimize 303
  304. WHEN TO USE ▫︎First entity has a small number of

    documents and they hardly change ▫︎First query results can be cached 304
  305. 4-4 DATA DENORMALIZATION 305

  306. DATA DENORMALIZATION ▫︎No joins ▫︎Store redundant copies of the data

    you need to query 306
  307. PUT /example/user/1 { "name": "John Smith", "email": "[email protected]", "born": "1970-10-24"

    } EXAMPLE 307
  308. PUT /example/blogpost/2 { "title": "Relationships", "body": "It's complicated", "user": {

    "id": 1, "name": "John Smith" } } EXAMPLE 308
  309. GET /example/blogpost/_search "query": { "bool": { "must": [ { "match":

    { "title": "relationships" }}, { "match": { "user.name": "John" }} ]}} EXAMPLE 309
  310. ADVANTAGES ▫︎Speed ▫︎No need for expensive joins 310

  311. DISADVANTAGES ▫︎Uses more disk space (cheap) ▫︎Update the same data

    in several places ▫︎scroll and bulk APIs can help ▫︎Concurrency issues ▫︎Locking can help 311
  312. WHEN TO USE ▫︎Need for fast search ▫︎Denormalized data does

    not change very often 312
  313. 4-5 NESTED OBJECTS 313

  314. MOTIVATION ▫︎Elasticsearch supports ACID when updating single documents ▫︎Querying related

    data in the same document is faster (no joins) ▫︎We want to avoid denormalization 314
  315. PUT /example/blogpost/1 { "title": "Nest eggs", "body": "Making money...", "tags":

    [ "cash", "shares" ], "comments": […] } THE PROBLEM WITH MULTILEVEL OBJECTS 315
  316. [{ "name": "John Smith", "comment": "Great article", "age": 28, "stars":

    4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this", "age": 31,"stars": 5, "date": "2014-10-22" }] THE PROBLEM WITH MULTILEVEL OBJECTS 316
  317. GET /example/blogpost/_search "query": { "bool": { "must": [ {"match": {"name":

    "Alice"}}, {"match": {"age": "28"}} ]}} THE PROBLEM WITH MULTILEVEL OBJECTS 317
  318. [{ "name": "John Smith", "comment": "Great article", "age": 28, "stars":

    4, "date": "2014-09-01" }, { "name": "Alice White", "comment": "More like this", "age": 31,"stars": 5, "date": "2014-10-22" }] THE PROBLEM WITH MULTILEVEL OBJECTS 318
  319. THE PROBLEM WITH MULTILEVEL OBJECTS ▫︎Alice is 31, not 28!

    ▫︎It matched the age of John ▫︎This is because indexed documents are stored as a flattened dictionary ▫︎The correlation between Alice and 31 is irretrievably lost 319
  320. {"title": [eggs, nest], "body": [making, money], "tags": [cash, shares], "comments.name":

    [alice, john, smith, white], "comments.comment": [article, great, like, more, this], "comments.age": [28, 31], "comments.stars": [4, 5], "comments.date": [2014-09-01, 2014-10-22]} THE PROBLEM WITH MULTILEVEL OBJECTS 320
  321. NESTED OBJECTS ▫︎Nested objects are indexed as hidden separate documents

    ▫︎Relationships are preserved ▫︎Joining nested documents is very fast 321
  322. {"comments.name": [john, smith], "comments.comment": [article, great], "comments.age": [28], "comments.stars": [4],

    "comments.date": [2014-09-01]} {"comments.name": [alice, white], "comments.comment": [like, more, this], "comments.age": [31], "comments.stars": [5], "comments.date": [2014-10-22]} NESTED OBJECTS 322
  323. { "title": [eggs, nest], "body": [making, money], "tags": [cash, shares]

    } NESTED OBJECTS 323
  324. NESTED OBJECTS ▫︎Need to be enabled by updating the mapping

    of the index 324
  325. PUT /example "mappings": { "blogpost": { "properties": { "comments": {

    "type": "nested", "properties": { "name": {"type": "string"}, "comment": {"type": "string"}, "age": {"type": "short"}, "stars": {"type":"short"}, "date": {"type": "date"} }}}}} MAPPING A NESTED OBJECT 325
  326. GET /example/blogpost/_search "query": { "bool": { "must": [ {"match": {"title":

    "eggs"}} {"nested": <NESTED QUERY>} ] } } QUERYING A NESTED OBJECT 326
  327. "nested": { "path": "comments", "query": { "bool": { "must": [

    {"match": {"comments.name": "john"}}, {"match": {"comments.age": 28}} ]}}} NESTED QUERY 327
  328. THERE’S MORE ▫︎Nested filters ▫︎Nested aggregations ▫︎Sorting by nested fields

    328
  329. ADVANTAGES ▫︎Very fast query-time joins ▫︎ACID support (single documents) ▫︎Convenient

    search using nested queries 329
  330. DISADVANTAGES ▫︎To add, change or delete a nested object, the

    whole document must be reindexed ▫︎Search requests return the whole document 330
  331. WHEN TO USE ▫︎When there is one main entity with

    a limited number of closely related entities ▫︎Ex: blogposts and comments ▫︎Inefficient if there are too many nested objects 331
  332. 4-6 PARENT-CHILD RELATIONSHIP 332

  333. PARENT-CHILD RELATIONSHIP ▫︎One-to-many relationship ▫︎Similar to the nested model ▫︎Nested

    objects live in the same document ▫︎Parent and children are completely separate documents 333
  334. EXAMPLE ▫︎Company with branches and employees ▫︎Branch is the parent

    ▫︎Employee are children 334
  335. PUT /company "mappings": { "branch": {}, "employee": { "_parent": {

    "type": "branch" } } } EXAMPLE 335
  336. PUT /company/branch/london { "name": "London Westminster", "city": "London", "country": "UK"

    } EXAMPLE 336
  337. PUT /company/employee/1? parent=london { "name": "Alice Smith", "born": "1970-10-24", "hobby":

    "hiking" } EXAMPLE 337
  338. GET /company/branch/_search "query": { "has_child": { "type": "employee", "query": {

    "range": { "born": { "gte": "1980-01-01" } }}}} FINDING PARENTS BY THEIR CHILDREN 338
  339. GET /company/employee/_search "query": { "has_parent": { "type": "branch", "query": {

    "match": { "country": "UK" } }}} FINDING CHILDREN BY THEIR PARENTS 339
  340. THERE’S MORE ▫︎min_children and max_children ▫︎Children aggregations ▫︎Grandparents and grandchildren

    340
  341. ADVANTAGES ▫︎Parent document can be updated without reindexing the children

    ▫︎Child documents can be updated without affecting the parent ▫︎Child documents can be returned in search results without the parent 341
  342. ADVANTAGES ▫︎Parent and children live on the same shard ▫︎Faster

    than application-side joins 342
  343. DISADVANTAGES ▫︎Parent document and all of its children must live

    on the same shard ▫︎5 to 10 times slower than nested queries 343
  344. WHEN TO USE ▫︎One-to-many relationships ▫︎When index-time is more important

    than search-time performance ▫︎Otherwise, use nested objects 344
  345. REFERENCES 345

  346. MAIN REFERENCE ▫︎Elasticsearch, The Definitive guide ▫︎Gormley & Tong ▫︎O'Reilly

    346
  347. OTHER REFERENCES ▫︎"Jepsen: simulating network partitions in DBs", http://github.com/aphyr/jepsen ▫︎"Call

    me maybe: Elasticsearch 1.5.0", http://aphyr.com/posts/323-call-me- maybe-elasticsearch-1-5-0 ▫︎"Call me maybe: MongoDB stale reads", http://aphyr.com/posts/322-call-me- maybe-mongodb-stale-reads 347
  348. OTHER REFERENCES ▫︎"Elasticsearch Data Resiliency Status", http://www.elastic.co/guide/en/ elasticsearch/resiliency/current/ index.html ▫︎"Solr

    vs. Elasticsearch — How to Decide?", http://blog.sematext.com/2015/01/30/ solr-elasticsearch-comparison/ 348
  349. OTHER REFERENCES ▫︎"Changing Mapping with Zero Downtime", http://www.elastic.co/blog/changing- mapping-with-zero-downtime 349

  350. Felipe Dornelas felipedornelas.com @felipead THANK YOU