Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch Workshop

Elasticsearch Workshop

A broad and very hands-on Elasticsearch overview in ~4 hours. You're going to learn the core fundamentals of Elasticsearch and also get a glimpse on important Information Retrieval and Distributed Systems concepts.

Part 1 - Core Concepts
Part 2 - Search & Analytics
Part 3 - Dealing with Human Language
Part 4 - Data Modeling

Please download the examples at http://github.com/felipead/elasticsearch-workshop

Felipe Dornelas

January 05, 2016
Tweet

More Decks by Felipe Dornelas

Other Decks in Technology

Transcript

  1. W o r k s h o p
    ELASTICSEARCH
    Felipe Dornelas

    View Slide

  2. AGENDA
    ▫︎Part 1
    ▫︎Introduction
    ▫︎Document Store
    ▫︎Search Examples
    ▫︎Data Resiliency
    ▫︎Comparison with Solr
    ▫︎Part 2
    ▫︎Search
    ▫︎Analytics
    2

    View Slide

  3. AGENDA
    ▫︎Part 3
    ▫︎Inverted Index
    ▫︎Analyzers
    ▫︎Mapping
    ▫︎Proximity Matching
    ▫︎Fuzzy Matching
    ▫︎Part 4
    ▫︎Inside a Cluster
    ▫︎Data Modeling
    3

    View Slide

  4. → github.com/felipead/
    elasticsearch-workshop
    4

    View Slide

  5. PRE-REQUISITES
    ▫︎Vagrant
    ▫︎VirtualBox
    ▫︎Git
    5

    View Slide

  6. ENVIRONMENT SETUP
    ▫︎git clone https://github.com/
    felipead/elasticsearch-workshop.git
    ▫︎vagrant up
    ▫︎vagrant ssh
    ▫︎cd /vagrant
    6

    View Slide

  7. VERIFY EVERYTHING IS WORKING
    ▫︎curl http://localhost:9200
    7

    View Slide

  8. PART 1
    Core concepts
    8

    View Slide

  9. 1-1 INTRODUCTION
    You know, for search
    9

    View Slide

  10. WHAT IS ELASTICSEARCH?
    A real-time distributed search and
    analytics engine
    10

    View Slide

  11. IT CAN BE USED FOR
    ▫︎Full-text search
    ▫︎Structured search
    ▫︎Real-time analytics
    ▫︎…or any combination of the above
    11

    View Slide

  12. FEATURES
    ▫︎Distributed document store:
    ▫︎RESTful API
    ▫︎Automatic scale
    ▫︎Plug & Play ™
    12

    View Slide

  13. FEATURES
    ▫︎Handles the human language:
    ▫︎Score results by relevance
    ▫︎Synonyms
    ▫︎Typos and misspellings
    ▫︎Internationalization
    13

    View Slide

  14. FEATURES
    ▫︎Powerful analytics:
    ▫︎Comprehensive aggregations
    ▫︎Geolocations
    ▫︎Can be combined with search
    ▫︎Real-time (no batch-processing)
    14

    View Slide

  15. FEATURES
    ▫︎Free and open source
    ▫︎Community support
    ▫︎Backed by Elastic
    15

    View Slide

  16. MOTIVATION
    Most databases are inept at
    extracting knowledge from your data
    16

    View Slide

  17. SQL DATABASES
    SQL = Structured Query Language
    17

    View Slide

  18. SQL DATABASES
    ▫︎Can only filter by exact values
    ▫︎Unable to perform full-text search
    ▫︎Queries can be complex and inefficient
    ▫︎Often requires big-batch processing
    18

    View Slide

  19. APACHE LUCENE
    ▫︎Arguably, the best search engine
    ▫︎High performance
    ▫︎Near real-time indexing
    ▫︎Open source
    19

    View Slide

  20. APACHE LUCENE
    ▫︎But…
    ▫︎It’s just a Java Library
    ▫︎Hard to use
    20

    View Slide

  21. ELASTICSEARCH
    ▫︎Document Store
    ▫︎Distributed
    ▫︎Scalable
    ▫︎Real Time
    ▫︎Analytics
    ▫︎RESTful API
    ▫︎Easy to Use
    21

    View Slide

  22. DOCUMENT ORIENTED
    ▫︎Documents instead of rows / columns
    ▫︎Every field is indexed and searchable
    ▫︎Serialized to JSON
    ▫︎Schemaless
    22

    View Slide

  23. WHO USES
    ▫︎GitHub
    ▫︎Wikipedia
    ▫︎Stack Overflow
    ▫︎The Guardian
    23

    View Slide

  24. TALKING TO ELASTICSEARCH
    ▫︎Java API
    ▫︎Port 9300
    ▫︎Native transport protocol
    ▫︎Node client (joins the cluster)
    ▫︎Transport client (doesn't join the cluster)
    24

    View Slide

  25. TALKING TO ELASTICSEARCH
    ▫︎RESTful API
    ▫︎Port 9200
    ▫︎JSON over HTTP
    25

    View Slide

  26. TALKING TO ELASTICSEARCH
    We will only cover the RESTful API
    26

    View Slide

  27. USING CURL
    curl -X -d
    or
    curl -X -d @
    27

    View Slide

  28. THE EMPTY QUERY
    curl -X GET
    -d @part-1/empty-query.json
    localhost:9200/_count?pretty
    28

    View Slide

  29. REQUEST
    {
    "query": {
    "match_all": {}
    }
    }
    29

    View Slide

  30. RESPONSE
    {
    "count": 0,
    "_shards": {
    "total": 0,
    "successful": 0,
    "failed": 0
    }
    }
    30

    View Slide

  31. 1-2 DOCUMENT STORE
    31

    View Slide

  32. THE PROBLEM WITH RELATIONAL DATABASES
    ▫︎Stores data in columns and rows
    ▫︎Equivalent of using a spreadsheet
    ▫︎Inflexible storage medium
    ▫︎Not suitable for rich objects
    32

    View Slide

  33. DOCUMENTS
    {
    "name": "John Smith",
    "age": 42,
    "confirmed": true,
    "join_date": "2015-06-01",
    "home": {"lat": 51.5, "lon": 0.1},
    "accounts": [
    {"type": "facebook", "id": "johnsmith"},
    {"type": "twitter", "id": "johnsmith"}
    ]
    }
    33

    View Slide

  34. DOCUMENT METADATA
    ▫︎Index - Where the document lives
    ▫︎Type - Class of object that the document
    represents
    ▫︎Id - Unique identifier for the document
    34

    View Slide

  35. DOCUMENT METADATA
    35
    Relational
    DB
    Databases Tables Rows Columns
    Elasticsearch Indices Types Documents Fields

    View Slide

  36. RESTFUL API
    [VERB] /{index}/{type}/{id}?pretty
    GET | POST | PUT | DELETE | HEAD
    36

    View Slide

  37. RESTFUL API
    ▫︎JSON-only
    ▫︎Adding pretty to the query-string
    parameters pretty-prints the response
    37

    View Slide

  38. INDEXING A DOCUMENT WITH YOUR OWN ID
    PUT /{index}/{type}/{id}
    38

    View Slide

  39. INDEXING A DOCUMENT WITH YOUR OWN ID
    curl -X PUT
    -d @part-1/first-blog-post.json
    localhost:9200/blog/post/123?pretty
    39

    View Slide

  40. REQUEST
    {
    "title": "My first blog post",
    "text": "Just trying this out...",
    "date": "2014-01-01"
    }
    40

    View Slide

  41. RESPONSE
    {
    "_index" : "blog",
    "_type" : "post",
    "_id" : "123",
    "_version" : 1,
    "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
    },
    "created" : true
    }
    41

    View Slide

  42. INDEXING A DOCUMENT WITH AUTOGENERATED ID
    POST /{index}/{type}
    * Autogenerated IDs are Base64-encoded UUIDs
    42

    View Slide

  43. INDEXING A DOCUMENT WITH AUTOGENERATED ID
    curl -X POST
    -d @part-1/second-blog-post.json
    localhost:9200/blog/post?pretty
    43

    View Slide

  44. REQUEST
    {
    "title": "Second blog post",
    "text": "Still trying this out...",
    "date": "2014-01-01"
    }
    44

    View Slide

  45. RESPONSE
    {
    "_index" : "blog",
    "_type" : "post",
    "_id" : "AVFWIbMf7YZ6Se7RwMws",
    "_version" : 1,
    "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
    },
    "created" : true
    }
    45

    View Slide

  46. RETRIEVING A DOCUMENT WITH METADATA
    GET /{index}/{type}/{id}
    46

    View Slide

  47. RETRIEVING A DOCUMENT WITH METADATA
    curl -X GET
    localhost:9200/blog/post/123?pretty
    47

    View Slide

  48. RESPONSE
    {
    "_index" : "blog",
    "_type" : "post",
    "_id" : "123",
    "_version" : 1,
    "found" : true,
    "_source": {
    "title": "My first blog entry",
    "text": "Just trying this out...",
    "date": "2014-01-01"
    }
    }
    48

    View Slide

  49. RETRIEVING A DOCUMENT WITHOUT METADATA
    GET /{index}/{type}/{id}/_source
    49

    View Slide

  50. RETRIEVING A DOCUMENT WITHOUT METADATA
    curl -X GET
    localhost:9200/blog/post/123/
    _source?pretty
    50

    View Slide

  51. RESPONSE
    {
    "title": "My first blog entry",
    "text": "Just trying this out...",
    "date": "2014-01-01"
    }
    51

    View Slide

  52. RETRIEVING PART OF A DOCUMENT
    GET /{index}/{type}/{id}
    ?_source={fields}
    52

    View Slide

  53. RETRIEVING PART OF A DOCUMENT
    curl -X GET
    'localhost:9200/blog/post/123?
    _source=title,date&pretty'
    53

    View Slide

  54. RESPONSE
    {
    "_index" : "blog",
    "_type" : "post",
    "_id" : "123",
    "_version" : 1,
    "found" : true,
    "_source": {
    "title": "My first blog entry",
    "date": "2014-01-01"
    }
    }
    54

    View Slide

  55. CHECKING WHETHER A DOCUMENT EXISTS
    HEAD /{index}/{type}/{id}
    55

    View Slide

  56. CHECKING WHETHER A DOCUMENT EXISTS
    curl -i —X HEAD
    localhost:9200/blog/post/123
    56

    View Slide

  57. RESPONSE
    HTTP/1.1 200 OK
    Content-Length: 0
    57

    View Slide

  58. CHECKING WHETHER A DOCUMENT EXISTS
    curl -i —X HEAD
    localhost:9200/blog/post/666
    58

    View Slide

  59. RESPONSE
    HTTP/1.1 404 Not Found
    Content-Length: 0
    59

    View Slide

  60. UPDATING A WHOLE DOCUMENT
    PUT /{index}/{type}/{id}
    60

    View Slide

  61. UPDATING A WHOLE DOCUMENT
    curl -X PUT
    -d @part-1/updated-blog-post.json
    localhost:9200/blog/post/123?pretty
    61

    View Slide

  62. REQUEST
    {
    "title": "My first blog post",
    "text": "I am starting to get the
    hang of this...",
    "date": "2014-01-02"
    }
    62

    View Slide

  63. RESPONSE
    {
    "_index" : "blog",
    "_type" : "post",
    "_id" : "123",
    "_version" : 2,
    "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
    },
    "created" : false
    }
    63

    View Slide

  64. DELETING A DOCUMENT
    DELETE /{index}/{type}/{id}
    64

    View Slide

  65. DELETING A DOCUMENT
    curl -X DELETE
    localhost:9200/blog/post/123?pretty
    65

    View Slide

  66. RESPONSE
    {
    "found" : true,
    "_index" : "blog",
    "_type" : "post",
    "_id" : "123",
    "_version" : 3,
    "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
    }
    }
    66

    View Slide

  67. DEALING WITH CONFLICTS
    67

    View Slide

  68. PESSIMISTIC CONCURRENCY CONTROL
    ▫︎Used by relational databases
    ▫︎Assumes conflicts are likely to happen
    (pessimist)
    ▫︎Blocks access to resources
    68

    View Slide

  69. OPTIMISTIC CONCURRENCY CONTROL
    ▫︎Assumes conflicts are unlikely to
    happen (optimist)
    ▫︎Does not block operations
    ▫︎If conflict happens, update fails
    69

    View Slide

  70. HOW ELASTICSEARCH DEALS WITH CONFLICTS
    ▫︎Locking distributed resources would be
    very inefficient
    ▫︎Uses Optimistic Concurrency Control
    ▫︎Auto-increments _version number
    70

    View Slide

  71. HOW ELASTICSEARCH DEALS WITH CONFLICTS
    ▫︎PUT /blog/post/123?version=1
    ▫︎If version is outdated returns 409 Conflict
    71

    View Slide

  72. 1-3 SEARCH EXAMPLES
    72

    View Slide

  73. EMPLOYEE DIRECTORY EXAMPLE
    ▫︎Index: megacorp
    ▫︎Type: employee
    ▫︎Ex: John Smith, Jane Smith, Douglas Fir
    73

    View Slide

  74. EMPLOYEE DIRECTORY EXAMPLE
    curl -X PUT
    -d @part-1/john-smith.json
    localhost:9200/megacorp/employee/1
    74

    View Slide

  75. REQUEST
    {
    "first_name": "John",
    "last_name": "Smith",
    "age": 25,
    "about": "I love to go rock climbing",
    "interests": ["sports", "music"]
    }
    75

    View Slide

  76. EMPLOYEE DIRECTORY EXAMPLE
    curl -X PUT
    -d @part-1/jane-smith.json
    localhost:9200/megacorp/employee/2
    76

    View Slide

  77. REQUEST
    {
    "first_name": "Jane",
    "last_name": "Smith",
    "age": 32,
    "about": "I like to collect rock albums",
    "interests": ["music"]
    }
    77

    View Slide

  78. EMPLOYEE DIRECTORY EXAMPLE
    curl -X PUT
    -d @part-1/douglas-fir.json
    localhost:9200/megacorp/employee/3
    78

    View Slide

  79. REQUEST
    {
    "first_name": "Douglas",
    "last_name": "Fir",
    "age": 35,
    "about": "I like to build cabinets",
    "interests": ["forestry"]
    }
    79

    View Slide

  80. SEARCHES ALL EMPLOYEES
    GET /megacorp/employee/_search
    80

    View Slide

  81. SEARCHES ALL EMPLOYEES
    curl -X GET
    localhost:9200/megacorp/employee/
    _search?pretty
    81

    View Slide

  82. SEARCH WITH QUERY-STRING
    GET /megacorp/employee/_search
    ?q=last_name:Smith
    82

    View Slide

  83. SEARCH WITH QUERY-STRING
    curl -X GET
    'localhost:9200/megacorp/employee/
    _search?q=last_name:Smith&pretty'
    83

    View Slide

  84. RESPONSE
    "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {

    "_score" : 0.30685282,
    "_source": {
    "first_name": "Jane",
    "last_name": "Smith", … }
    }, {

    "_score" : 0.30685282,
    "_source": {
    "first_name": "John",
    "last_name": "Smith", … }
    } ]
    }
    84

    View Slide

  85. SEARCH WITH QUERY DSL
    curl -X GET
    -d @part-1/last-name-query.json
    localhost:9200/megacorp/employee/
    _search?pretty
    85

    View Slide

  86. REQUEST
    {
    "query": {
    "match": {
    "last_name": "Smith"
    }
    }
    }
    86

    View Slide

  87. RESPONSE
    "hits" : {
    "total" : 2,
    "max_score" : 0.30685282,
    "hits" : [ {

    "_score" : 0.30685282,
    "_source": {
    "first_name": "Jane",
    "last_name": "Smith", … }
    }, {

    "_score" : 0.30685282,
    "_source": {
    "first_name": "John",
    "last_name": "Smith", … }
    } ]
    }
    87

    View Slide

  88. SEARCH WITH QUERY DSL AND FILTER
    curl -X GET
    -d @part-1/last-name-age-query.json
    localhost:9200/megacorp/employee/
    _search?pretty
    88

    View Slide

  89. REQUEST
    "query": {
    "filtered": {
    "filter": {
    "range": {
    "age": { "gt": 30 }
    }
    },
    "query": {
    "match": { "last_name": "Smith" }
    }
    }
    }
    89

    View Slide

  90. RESPONSE
    "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {

    "_score" : 0.30685282,
    "_source": {
    "first_name": "Jane",
    "last_name": "Smith",
    "age": 32, … }
    } ]
    90

    View Slide

  91. FULL-TEXT SEARCH
    curl -X GET
    -d @part-1/full-text—search.json
    localhost:9200/megacorp/employee/
    _search?pretty
    91

    View Slide

  92. REQUEST
    {
    "query": {
    "match": {
    "about": "rock climbing"
    }
    }
    }
    92

    View Slide

  93. RESPONSE
    "hits" : [{ …
    "_score" : 0.16273327,
    "_source": {
    "first_name": "John", "last_name": "Smith",
    "about": "I love to go rock climbing", … }
    }, { …
    "_score" : 0.016878016,
    "_source": {
    "first_name": "Jane", "last_name": "Smith",
    "about": "I like to collect rock albums", … }
    }]
    93

    View Slide

  94. RELEVANCE SCORES
    ▫︎The _score field ranks searches results
    ▫︎The higher the score, the better
    94

    View Slide

  95. PHRASE SEARCH
    curl -X GET
    -d @part-1/phrase-search.json
    localhost:9200/megacorp/employee/
    _search?pretty
    95

    View Slide

  96. REQUEST
    {
    "query": {
    "match_phrase": {
    "about": "rock climbing"
    }
    }
    }
    96

    View Slide

  97. RESPONSE
    "hits" : {
    "total" : 1,
    "max_score" : 0.23013961,
    "hits" : [ {

    "_score" : 0.23013961,
    "_source": {
    "first_name": "John",
    "last_name": "Smith",
    "about": "I love to go rock climbing"
    … }
    } ]
    }
    97

    View Slide

  98. 1-4 DATA RESILIENCY
    98

    View Slide

  99. CALL ME MAYBE
    ▫︎Jepsen Tests
    ▫︎Simulates network partition scenarios
    ▫︎Run several operations against a
    distributed system
    ▫︎Verify that the history of those
    operations makes sense
    99

    View Slide

  100. NETWORK PARTITION
    100

    View Slide

  101. ELASTICSEARCH STATUS
    ▫︎Risk of data loss on network partition
    and split-brain scenarios
    101

    View Slide

  102. IT IS NOT SO BAD…
    ▫︎Still much more resilient than MongoDB
    ▫︎Elastic is working hard to improve it
    ▫︎Two-phase commits are planned
    102

    View Slide

  103. IF YOU REALLY CARE ABOUT YOUR DATA
    ▫︎Use a more reliable primary data store:
    ▫︎Cassandra
    ▫︎Postgres
    ▫︎Synchronize it to Elasticsearch
    ▫︎…or set-up comprehensive back-up
    103

    View Slide

  104. There’s no such thing as a 100%
    reliable distributed system
    104

    View Slide

  105. 1-5 SOLR COMPARISON
    105

    View Slide

  106. SOLR
    ▫︎SolrCloud
    ▫︎Both:
    ▫︎Are open-source and mature
    ▫︎Are based on Apache Lucene
    ▫︎Have more or less similar features
    106

    View Slide

  107. SOLR API
    ▫︎HTTP GET
    ▫︎Query parameters passed in as URL
    parameters
    ▫︎Is not RESTful
    ▫︎Multiple formats (JSON, XML…)
    107

    View Slide

  108. SOLR API
    ▫︎Version 4.4 added Schemaless API
    ▫︎Older versions require up-front Schema
    108

    View Slide

  109. ELASTICSEARCH API
    ▫︎RESTful
    ▫︎Schemaless
    ▫︎CRUD document operations
    ▫︎Manage indices, read metrics, etc…
    109

    View Slide

  110. ELASTICSEARCH API
    ▫︎Query DSL
    ▫︎Better readability
    ▫︎JSON-only
    110

    View Slide

  111. SEARCH
    ▫︎Both are very good with text search
    ▫︎Both based on Apache Lucene
    111

    View Slide

  112. EASYNESS OF USE
    ▫︎Elasticsearch is simpler:
    ▫︎Just a single process
    ▫︎Easier API
    ▫︎SolrCloud requires Apache ZooKeeper
    112

    View Slide

  113. SOLRCLOUD DATA RESILIENCY
    ▫︎SolrCloud uses Apache ZooKeeper to
    discover nodes
    ▫︎Better at preventing split-brain
    conditions
    ▫︎Jepsen Tests pass
    113

    View Slide

  114. ANALYTICS
    ▫︎Elasticsearch is the choice for analytics:
    ▫︎Comprehensive aggregations
    ▫︎Thousands of metrics
    ▫︎SolrCloud is not even close
    114

    View Slide

  115. PART 2
    Search and Analytics
    115

    View Slide

  116. 2-1 SEARCH
    Finding the needle in the haystack
    116

    View Slide

  117. TWEETS EXAMPLE
    ▫︎//user
    ▫︎//tweet
    117

    View Slide

  118. TWEETS EXAMPLE
    /us/user/1
    {
    "email": "[email protected]",
    "name": "John Smith",
    "username": "@john"
    }
    118

    View Slide

  119. TWEETS EXAMPLE
    /gb/user/2
    {
    "email": "[email protected]",
    "name": "Mary Jones",
    "username": "@mary"
    }
    119

    View Slide

  120. TWEET EXAMPLE
    /gb/tweet/3
    {
    "date": "2014-09-13",
    "name": "Mary Jones",
    "tweet": "Elasticsearch means full
    text search has never been so easy",
    "user_id": 2
    }
    120

    View Slide

  121. TWEETS EXAMPLE
    ./part-2/load-tweet-data.sh
    121

    View Slide

  122. GET /_search
    ▫︎Returns all documents on all indices
    THE EMPTY SEARCH
    122

    View Slide

  123. THE EMPTY SEARCH
    curl -X GET
    localhost:9200/_search?pretty
    123

    View Slide

  124. THE EMPTY SEARCH
    "hits" : {
    "total" : 14,
    "hits" : [
    {
    "_index": "us",
    "_type": "tweet",
    "_id": "7",
    "_score": 1,
    "_source": {
    "date": "2014-09-17",
    "name": "John Smith",
    "tweet": "The Query DSL is really powerful and flexible",
    "user_id": 2
    }
    },
    … 9 RESULTS REMOVED …
    ]
    }
    124

    View Slide

  125. MULTI-INDEX, MULTITYPE SEARCH
    ▫︎/_search
    ▫︎/gb/_search
    ▫︎/gb,us/_search
    ▫︎/gb/user/_search
    ▫︎/_all/user,tweet/_search
    125

    View Slide

  126. PAGINATION
    ▫︎Returns 10 results per request (default)
    ▫︎Control parameters:
    ▫︎size: number of results to return
    ▫︎from: number of results to skip
    126

    View Slide

  127. PAGINATION
    ▫︎GET /_search?size=5
    ▫︎GET /_search?size=5&from=5
    ▫︎GET /_search?size=5&from=10
    127

    View Slide

  128. TYPES OF SEARCH
    ▫︎Structured query on concrete fields
    (similar to SQL)
    ▫︎Full-text query (sorts results by
    relevance)
    ▫︎Combination of the two
    128

    View Slide

  129. SEARCH BY EXACT VALUES
    ▫︎Examples:
    ▫︎date
    ▫︎user ID
    ▫︎username
    ▫︎“Does this document match the query?”
    129

    View Slide

  130. SELECT * FROM user
    WHERE name = "John Smith"
    AND user_id = 2
    AND date > "2014-09-15"
    ▫︎SQL queries:
    SEARCH BY EXACT VALUES
    130

    View Slide

  131. FULL-TEXT SEARCH
    ▫︎Examples:
    ▫︎the text of a tweet
    ▫︎body of an email
    ▫︎“How well does this document match
    the query?”
    131

    View Slide

  132. FULL-TEXT SEARCH
    ▫︎UK should also match United Kingdom
    ▫︎jump should also match jumped, jumps,
    jumping and leap
    132

    View Slide

  133. FULL-TEXT SEARCH
    ▫︎fox news hunting should return stories
    about hunting on Fox News
    ▫︎fox hunting news should return news
    stories about fox hunting
    133

    View Slide

  134. HOW ELASTICSEARCH PERFORMS TEXT SEARCH
    ▫︎Analyzes the text
    ▫︎Tokenizes into terms
    ▫︎Normalizes the terms
    ▫︎Builds an inverted index
    134

    View Slide

  135. LIST OF INDEXED DOCUMENTS
    135
    ID Text
    1
    Baseball is played during summer
    months.
    2 Summer is the time for picnics here.
    3 Months later we found out why.
    4 Why is summer so hot here.

    View Slide

  136. INVERTED INDEX
    136
    Term Frequency Document IDs
    baseball 1 1
    during 1 1
    found 1 3
    here 2 2, 4
    hot 1 4
    is 3 1, 2, 4
    months 2 1, 3
    summer 3 1, 2, 4
    the 1 2
    why 2 3, 4

    View Slide

  137. GET /_search
    {
    "query": YOUR_QUERY_HERE
    }
    QUERY DSL
    137

    View Slide

  138. {
    "match": {
    "tweet": "elasticsearch"
    }
    }
    QUERY BY FIELD
    138

    View Slide

  139. QUERY BY FIELD
    curl -X GET -d
    @part-2/elasticsearch-tweets-query.json
    localhost:9200/_all/tweet/_search
    139

    View Slide

  140. {
    "bool":
    "must": {
    "match": { "tweet": "elasticsearch"}
    },
    "must_not": {
    "match": { "name": "mary" }
    },
    "should": {
    "match": { "tweet": "full text" }
    }
    }
    QUERY WITH MULTIPLE CLAUSES
    140

    View Slide

  141. QUERY WITH MULTIPLE CLAUSES
    curl -X GET -d
    @part-2/combining-tweet-queries.json
    localhost:9200/_all/tweet/_search
    141

    View Slide

  142. "_score": 0.07082729, "_source": { …
    "name": "John Smith",
    "tweet": "The Elasticsearch API is really easy to use"
    }, …
    "_score": 0.049890988, "_source": { …
    "name": "John Smith",
    "tweet": "Elasticsearch surely is one of the hottest
    new NoSQL products"
    }, …
    "_score": 0.03991279, "_source": { …
    "name": "John Smith",
    "tweet": "Elasticsearch and I have left the honeymoon
    stage, and I still love her." }
    QUERY WITH MULTIPLE CLAUSES
    142

    View Slide

  143. MOST IMPORTANT QUERIES
    ▫︎match
    ▫︎match_all
    ▫︎multi_match
    ▫︎bool
    143

    View Slide

  144. QUERIES VS. FILTERS
    ▫︎Queries:
    ▫︎full-text
    ▫︎“how well does the document match?”
    ▫︎Filters:
    ▫︎exact values
    ▫︎yes-no questions
    144

    View Slide

  145. QUERIES VS. FILTERS
    ▫︎The goal of filters is to reduce the
    number of documents that have to be
    examined by a query
    145

    View Slide

  146. PERFORMANCE COMPARISON
    ▫︎Filters are easy to cache and can be
    reused efficiently
    ▫︎Queries are heavier and non-cacheable
    146

    View Slide

  147. WHEN TO USE WHICH
    ▫︎Use queries only for full-text search
    ▫︎Use filters for anything else
    147

    View Slide

  148. "filtered": {
    "filter": {
    "term": {
    "user_id": 1
    }
    }
    }
    FILTER BY EXACT FIELD VALUES
    148

    View Slide

  149. FILTER BY EXACT FIELD VALUES
    curl -X GET -d
    @part-2/user-id—filter.json
    localhost:9200/_search
    149

    View Slide

  150. "filtered": {
    "filter": {
    "range": {
    "date": {
    "gte": "2014-09-20"
    }
    }
    }
    }
    FILTER BY EXACT FIELD VALUES
    150

    View Slide

  151. FILTER BY EXACT FIELD VALUES
    curl -X GET -d
    @part-2/date—filter.json
    localhost:9200/_search
    151

    View Slide

  152. MOST IMPORTANT FILTERS
    ▫︎term
    ▫︎terms
    ▫︎range
    ▫︎exists and missing
    ▫︎bool
    152

    View Slide

  153. "filtered": {
    "query": {
    "match": {
    "tweet": "elasticsearch"
    }
    },
    "filter": {
    "term": { "user_id": 1 }
    }
    }
    COMBINING QUERIES WITH FILTERS
    153

    View Slide

  154. COMBINING QUERIES WITH FILTERS
    curl -X GET -d
    @part-2/filtered—tweet-query.json
    localhost:9200/_search
    154

    View Slide

  155. SORTING
    ▫︎Relevance score
    ▫︎The higher the score, the better
    ▫︎By default, results are returned in
    descending order of relevance
    ▫︎You can sort by any field
    155

    View Slide

  156. RELEVANCE SCORE
    ▫︎Similarity algorithm
    ▫︎Term Frequency / Inverse Document
    Frequency (TF/IDF)
    156

    View Slide

  157. RELEVANCE SCORE
    ▫︎Term frequency
    ▫︎How often does the term appear in
    the field?
    ▫︎The more often, the more relevant
    157

    View Slide

  158. RELEVANCE SCORE
    ▫︎Inverse document frequency
    ▫︎How often does each term appear in
    the index?
    ▫︎The more often, the less relevant
    158

    View Slide

  159. RELEVANCE SCORE
    ▫︎Field-length norm
    ▫︎How long is the field?
    ▫︎The longer it is, the less likely it is that
    words in the field will be relevant
    159

    View Slide

  160. 2-2 ANALYTICS
    How many needles are in the haystack?
    160

    View Slide

  161. SEARCH
    ▫︎Just looks for the needle in the haystack
    161

    View Slide

  162. BUSINESS QUESTIONS
    ▫︎How many needles are in the haystack?
    ▫︎What is the needle average length?
    ▫︎What is the median length of the
    needles, by manufacturer?
    ▫︎How many needles were added to the
    haystack each month?
    162

    View Slide

  163. BUSINESS QUESTIONS
    ▫︎What are your most popular needle
    manufactures?
    ▫︎Are there any anomalous clumps of
    needles?
    163

    View Slide

  164. AGGREGATIONS
    ▫︎Answer Analytics questions
    ▫︎Can be combined with Search
    ▫︎Near real-time in Elasticsearch
    ▫︎SQL queries can take days
    164

    View Slide

  165. AGGREGATIONS
    Buckets + Metrics
    165

    View Slide

  166. BUCKETS
    ▫︎Collection of documents that meet a
    certain criteria
    ▫︎Can be nested inside other buckets
    166

    View Slide

  167. BUCKETS
    ▫︎Employee 㱺 male or female bucket
    ▫︎San Francisco 㱺 California bucket
    ▫︎2014-10-28 㱺 October bucket
    167

    View Slide

  168. METRICS
    ▫︎Calculations on top of buckets
    ▫︎Answer the questions
    ▫︎Ex: min, max, mean, sum…
    168

    View Slide

  169. EXAMPLE
    ▫︎Partition by country (bucket)
    ▫︎…then partition by gender (bucket)
    ▫︎…then partition by age ranges (bucket)
    ▫︎…calculate the average salary for each
    age range (metric)
    169

    View Slide

  170. CAR TRANSACTIONS EXAMPLE
    ▫︎/cars/transactions
    170

    View Slide

  171. CAR TRANSACTIONS EXAMPLE
    /cars/transactions/
    AVFr1xbVmdUYWpF46Ps4
    {
    "price" : 10000,
    "color" : "red",
    "make" : "honda",
    "sold" : "2014-10-28"
    }
    171

    View Slide

  172. CAR TRANSACTIONS EXAMPLE
    ./part-2/load-car-data.sh
    172

    View Slide

  173. {
    "aggs": {
    "colors": {
    "terms": {
    "fields": "color"
    }
    }
    }
    }
    BEST SELLING CAR COLOR
    173

    View Slide

  174. BEST SELLING CAR COLOR
    curl -X GET -d
    @part-2/best-selling-car-color.json
    'localhost:9200/cars/transactions/
    _search?search_type=count&pretty'
    174

    View Slide

  175. "colors" : {
    "buckets" : [{
    "key" : "red",
    "doc_count" : 16
    }, {
    "key" : "blue",
    "doc_count" : 8
    }, {
    "key" : "green",
    "doc_count" : 8
    }]
    }
    BEST SELLING CAR COLOR
    175

    View Slide

  176. {
    "aggs": {
    "colors": {
    "terms": { "field": "color" },
    "aggs": {
    "avg_price": {
    "avg": { "field": "price" }
    }
    }
    }
    }
    }
    AVERAGE CAR COLOR PRICE
    176

    View Slide

  177. AVERAGE CAR COLOR PRICE
    curl -X GET -d
    @part-2/average-car—color-price.json
    'localhost:9200/cars/transactions/
    _search?search_type=count&pretty'
    177

    View Slide

  178. "colors" : {
    "buckets": [{
    "key": "red", "doc_count": 16,
    "avg_price": { "value": 32500.0 }
    }, {
    "key": "blue", "doc_count": 8,
    "avg_price": { "value": 20000.0 }
    }, {
    "key": "green", "doc_count": 8,
    "avg_price": { "value": 21000.0 }
    }]
    }
    AVERAGE CAR COLOR PRICE
    178

    View Slide

  179. BUILDING BAR CHARTS
    ▫︎Very easy to convert aggregations to
    charts and graphs
    ▫︎Ex: histograms and time-series
    179

    View Slide

  180. {
    "aggs": {
    "price": {
    "histogram": {
    "field": "price",
    "interval": 20000
    },
    "aggs": {
    "revenue": {"sum": {"field" : "price"}}
    }
    }
    }
    }
    CAR SALES REVENUE HISTOGRAM
    180

    View Slide

  181. CAR SALES REVENUE HISTOGRAM
    curl -X GET -d
    @part-2/car-revenue-histogram.json
    'localhost:9200/cars/transactions/
    _search?search_type=count&pretty'
    181

    View Slide

  182. "price" : {
    "buckets": [
    { "key": 0, "doc_count": 12,
    "revenue": {"value": 148000.0} },
    { "key": 20000, "doc_count": 16,
    "revenue": {"value": 380000.0} },
    { "key": 40000, "doc_count": 0,
    "revenue": {"value": 0.0} },
    { "key": 60000, "doc_count": 0,
    "revenue": {"value": 0.0} },
    { "key": 80000, "doc_count": 4,
    "revenue": {"value" : 320000.0} }
    ]}
    CAR SALES REVENUE HISTOGRAM
    182

    View Slide

  183. CAR SALES REVENUE HISTOGRAM
    183

    View Slide

  184. TIME-SERIES DATA
    ▫︎Data with a timestamp:
    ▫︎How many cars sold each month this
    year?
    ▫︎What was the price of this stock for
    the last 12 hours?
    ▫︎What was the average latency of our
    website every hour in the last week?
    184

    View Slide

  185. {
    "aggs": {
    "sales": {
    "date_histogram": {
    "field": "sold",
    "interval": "month",
    "format": "yyyy-MM-dd"
    }
    }
    }
    }
    HOW MANY CARS SOLD PER MONTH?
    185

    View Slide

  186. HOW MANY CARS SOLD PER MONTH?
    curl -X GET -d
    @part-2/car-sales-per-month.json
    'localhost:9200/cars/transactions/
    _search?search_type=count&pretty'
    186

    View Slide

  187. "sales" : {
    "buckets" : [
    {"key_as_string": "2014-01-01", "doc_count": 4},
    {"key_as_string": "2014-02-01", "doc_count": 4},
    {"key_as_string": "2014-03-01", "doc_count": 0},
    {"key_as_string": "2014-04-01", "doc_count": 0},
    {"key_as_string": "2014-05-01", "doc_count": 4},
    {"key_as_string": "2014-06-01", "doc_count": 0},
    {"key_as_string": "2014-07-01", "doc_count": 4},
    {"key_as_string": "2014-08-01", "doc_count": 4},
    {"key_as_string": "2014-09-01", "doc_count": 0},
    {"key_as_string": "2014-10-01", "doc_count": 4},
    {"key_as_string": "2014-11-01", "doc_count": 8}
    ]
    }
    HOW MANY CARS SOLD PER MONTH?
    187

    View Slide

  188. HOW MANY CARS SOLD PER MONTH?
    188

    View Slide

  189. PART 3
    Dealing with human language
    189

    View Slide

  190. 3-1 INVERTED INDEX
    190

    View Slide

  191. INVERTED INDEX
    ▫︎Data structure
    ▫︎Efficient full-text search
    191

    View Slide

  192. EXAMPLE
    192
    The quick brown fox jumped
    over the lazy dog
    Quick brown foxes leap over
    lazy dogs in summer
    Document 1
    Document 2

    View Slide

  193. TOKENIZATION
    193
    ["The", "quick", "brown",
    "fox", "jumped", "over",
    "the", "lazy", "dog"]
    ["Quick", "brown", "foxes",
    "leap", "over", "lazy",
    "dogs", "in", "summer"]
    Document 1
    Document 2

    View Slide

  194. 194
    Term Document 1 Document 2
    Quick
    The
    brown
    dog
    dogs
    fox
    foxes
    in
    jumped
    lazy
    leap
    over
    quick
    summer
    the

    View Slide

  195. EXAMPLE
    ▫︎Searching for “quick brown”
    ▫︎Naive similarity algorithm:
    ▫︎Document 1 is a better match
    195
    Term Document 1 Document 2
    brown
    quick
    Total 2 1

    View Slide

  196. A FEW PROBLEMS
    ▫︎Quick and quick are the same word
    ▫︎fox and foxes are pretty similar
    ▫︎jumped and leap are synonyms
    196

    View Slide

  197. NORMALIZATION
    ▫︎Quick lowercased to quick
    ▫︎foxes stemmed to fox
    ▫︎jumped and leap replaced by jump
    197

    View Slide

  198. BETTER INVERTED INDEX
    198
    Term Document 1 Document 2
    brown
    dog
    fox
    in
    jump
    lazy
    over
    quick
    summer
    the

    View Slide

  199. SEARCH INPUT
    ▫︎You can only find terms that exist in the
    inverted index
    ▫︎The query string is also normalized
    199

    View Slide

  200. 3-2 ANALYZERS
    200

    View Slide

  201. ANALYSIS
    ▫︎Tokenizes a block of text into terms
    ▫︎Normalizes terms to standard form
    ▫︎Improves searchability
    201

    View Slide

  202. ANALYZERS
    ▫︎Pipeline:
    ▫︎Character filters
    ▫︎Tokenizer
    ▫︎Token filters
    202

    View Slide

  203. BUILT-IN ANALYZERS
    ▫︎Standard analyzer
    ▫︎Language-specific analyzers
    ▫︎30+ languages supported
    203

    View Slide

  204. GET /_analyze?
    analyzer=standard
    The quick brown fox jumped
    over the lazy dog.
    TESTING THE STANDARD ANALYZER
    204

    View Slide

  205. TESTING THE STANDARD ANALYZER
    curl -X GET -d
    @part-3/quick-brown-fox.txt
    'localhost:9200/_analyze?
    analyzer=standard&pretty'
    205

    View Slide

  206. "tokens" : [
    {"token": "the", …},
    {"token": "quick", …},
    {"token": "brown", …},
    {"token": "fox", …},
    {"token": "jumps", …},
    {"token": "over", …},
    {"token": "the", …},
    {"token": "lazy", …},
    {"token": "dog", …}
    ]
    TESTING THE STANDARD ANALYZER
    206

    View Slide

  207. GET /_analyze?analyzer=english
    The quick brown fox jumped
    over the lazy dog.
    TESTING THE ENGLISH ANALYZER
    207

    View Slide

  208. TESTING THE ENGLISH ANALYZER
    curl -X GET -d
    @part-3/quick-brown-fox.txt
    'localhost:9200/_analyze?
    analyzer=english&pretty'
    208

    View Slide

  209. "tokens" : [
    {"token": "quick", …},
    {"token": "brown", …},
    {"token": "fox", …},
    {"token": "jump", …},
    {"token": "over", …},
    {"token": "lazi", …},
    {"token": "dog", …}
    ]
    TESTING THE ENGLISH ANALYZER
    209

    View Slide

  210. GET /_analyze?
    analyzer=brazilian
    A rápida raposa marrom pulou
    sobre o cachorro preguiçoso.
    TESTING THE BRAZILIAN ANALYZER
    210

    View Slide

  211. TESTING THE BRAZILIAN ANALYZER
    curl -X GET -d
    @part-3/raposa-rapida.txt
    'localhost:9200/_analyze?
    analyzer=brazilian&pretty'
    211

    View Slide

  212. "tokens" : [
    {"token": "rap", …},
    {"token": "rapos", …},
    {"token": "marrom", …},
    {"token": "pul", …},
    {"token": "cachorr", …},
    {"token": "preguic", …}
    ]
    TESTING THE BRAZILIAN ANALYZER
    212

    View Slide

  213. STEMMERS
    ▫︎Algorithmic stemmers:
    ▫︎Faster
    ▫︎Less precise
    ▫︎Dictionary stemmers:
    ▫︎Slower
    ▫︎More precise
    213

    View Slide

  214. 3-3 MAPPING
    214

    View Slide

  215. MAPPING
    ▫︎Every document has a type
    ▫︎Every type has its own mapping
    ▫︎A mapping defines:
    ▫︎The fields
    ▫︎The datatype for each field
    215

    View Slide

  216. MAPPING
    ▫︎Elasticsearch guesses the mapping when
    a new field is added
    ▫︎Should customize the mapping for
    improved search and performance
    ▫︎Must customize the mapping when type
    is created
    216

    View Slide

  217. MAPPING
    ▫︎A field's mapping cannot be changed
    ▫︎You can still add new fields
    ▫︎Only option is to reindex all documents
    ▫︎Reindexing with zero-downtime:
    ▫︎index aliases
    217

    View Slide

  218. CORE FIELD TYPES
    ▫︎String
    ▫︎Integer
    ▫︎Floating-point
    ▫︎Boolean
    ▫︎Date
    ▫︎Inner Objects
    218

    View Slide

  219. GET /{index}/_mapping/{type}
    VIEWING THE MAPPING
    219

    View Slide

  220. VIEWING THE MAPPING
    curl -X GET
    'localhost:9200/gb/_mapping/
    tweet?pretty'
    220

    View Slide

  221. "date": {
    "type": "date",
    "format":
    "strict_date_optional_time…"
    },
    "name": {
    "type": "string"
    },
    "tweet": {
    "type": "string"
    },
    "user_id": {
    "type": "long"
    }
    VIEWING THE MAPPING
    221

    View Slide

  222. CUSTOMIZING FIELD MAPPINGS
    ▫︎Distinguish between:
    ▫︎Full-text string fields
    ▫︎Exact value string fields
    ▫︎Use language-specific analyzers
    222

    View Slide

  223. STRING MAPPING ATTRIBUTES
    ▫︎index:
    ▫︎analyzed (full-text search, default)
    ▫︎not_analyzed (exact value)
    ▫︎analyzer:
    ▫︎standard (default)
    ▫︎english
    ▫︎…
    223

    View Slide

  224. PUT /gb,us/_mapping/tweet
    {
    "properties": {
    "description": {
    "type": "string",
    "index": "analyzed",
    "analyzer": "english"
    }
    }
    }
    ADDING NEW SEARCHABLE FIELD
    224

    View Slide

  225. ADDING NEW SEARCHABLE FIELD
    curl -X PUT -d
    @part-3/add-new-mapping.json
    'localhost:9200/gb,us/
    _mapping/tweet?pretty'
    225

    View Slide

  226. ADDING NEW SEARCHABLE FIELD
    curl -X GET
    'localhost:9200/us,gb/
    _mapping/tweet?pretty'
    226

    View Slide


  227. "description": {
    "type": "string",
    "analyzer": "english"
    }…
    ADDING NEW SEARCHABLE FIELD
    227

    View Slide

  228. 3-4 PROXIMITY
    MATCHING
    228

    View Slide

  229. THE PROBLEM
    ▫︎Sue ate the alligator
    ▫︎The alligator ate Sue
    ▫︎Sue never goes anywhere without her
    alligator-skin purse
    229

    View Slide

  230. THE PROBLEM
    ▫︎Search for “sue alligator” would match
    all three
    ▫︎Sue and alligator may be separated by
    paragraphs of other text
    230

    View Slide

  231. HEURISTIC
    ▫︎Words that appear near each other are
    probably related
    ▫︎Give documents in which the words are
    close together a higher relevance score
    231

    View Slide

  232. GET /_analyze?
    analyzer=standard
    Quick brown fox.
    TERM POSITIONS
    232

    View Slide

  233. "tokens": [
    { "token": "quick", …
    "position": 1 },
    { "token": "brown", …
    "position": 2 },
    { "token": "fox", …
    "position": 3 }
    ]
    TERM POSITIONS
    233

    View Slide

  234. GET /{index}/{type}/_search
    {
    "query": {
    "match_phrase": {
    "title": "quick brown fox"
    }
    }
    }
    EXACT PHRASE MATCHING
    234

    View Slide

  235. EXACT PHRASE MATCHING
    ▫︎quick, brown and fox must all appear
    ▫︎The position of brown must be 1 greater
    than the position of quick
    ▫︎The position of fox must be 2 greater
    than the position of quick
    235
    quick brown fox

    View Slide

  236. FLEXIBLE PHRASE MATCHING
    ▫︎Exact phrase matching is too strict
    ▫︎“quick fox” should also match
    ▫︎Slop matching
    236
    quick brown fox

    View Slide

  237. "query": {
    "match_phrase": {
    "title": {
    "query": "quick fox",
    "slop": 1
    }
    }
    }
    FLEXIBLE PHRASE MATCHING
    237

    View Slide

  238. SLOP MATCHING
    ▫︎How many times you are allowed to
    move a term in order to make the query
    and document match?
    ▫︎Slop(n)
    238

    View Slide

  239. SLOP MATCHING
    239
    quick brown fox
    quick fox
    quick fox

    Document
    Query
    Slop(1)

    View Slide

  240. SLOP MATCHING
    240
    quick brown fox
    fox quick
    fox quick ↵
    Document
    Query
    Slop(1)

    quick fox
    Slop(2)

    quick fox
    Slop(3)

    View Slide

  241. 3-5 FUZZY MATCHING
    241

    View Slide

  242. FUZZY MATCHING
    ▫︎quick brown fox → fast brown foxes
    ▫︎Johnny Walker → Johnnie Walker
    ▫︎Shcwarzenneger → Schwarzenegger
    242

    View Slide

  243. DAMERAU-LEVENSHTEIN EDIT DISTANCE
    ▫︎One-character edits:
    ▫︎Substitution
    ▫︎Insertion
    ▫︎Deletion
    ▫︎Transposition of two adjacent
    characters
    243

    View Slide

  244. DAMERAU-LEVENSHTEIN EDIT DISTANCE
    ▫︎One-character substitution:
    ▫︎ fox → box
    244

    View Slide

  245. DAMERAU-LEVENSHTEIN EDIT DISTANCE
    ▫︎Insertion of a new character:
    ▫︎sic → sick
    245

    View Slide

  246. DAMERAU-LEVENSHTEIN EDIT DISTANCE
    ▫︎Deletion of a character:
    ▫︎black → back
    246

    View Slide

  247. DAMERAU-LEVENSHTEIN EDIT DISTANCE
    ▫︎Transposition of two adjacent
    characters:
    ▫︎star → tsar
    247

    View Slide

  248. DAMERAU-LEVENSHTEIN EDIT DISTANCE
    ▫︎Converting bieber into beaver
    1. Substitute: bieber → biever
    2. Substitute: biever → baever
    3. Transpose: baever → beaver
    ▫︎Edit distance of 3
    248

    View Slide

  249. FUZINESS
    ▫︎80% of human misspellings have an Edit
    Distance of 1
    ▫︎Elasticsearch supports a maximum Edit
    Distance of 2
    ▫︎fuziness operator
    249

    View Slide

  250. FUZZINESS EXAMPLE
    ./part-3/load-surprise-data.sh
    250

    View Slide

  251. GET /example/surprise/_search
    {
    "query": {
    "match": {
    "text": {
    "query": "surprize"
    }
    }
    }
    }
    QUERY WITHOUT FUZZINESS
    251

    View Slide

  252. QUERY WITHOUT FUZZINESS
    curl -X GET -d
    @part-3/surprize-query.json
    'localhost:9200/example/
    surprise/_search?pretty'
    252

    View Slide

  253. "hits": {
    "total": 0,
    "max_score": null,
    "hits": [ ]
    }
    QUERY WITHOUT FUZZINESS
    253

    View Slide

  254. GET /example/surprise/_search
    {
    "query": {
    "match": {
    "text": {
    "query": "surprize",
    "fuzziness": "1"
    }
    }
    }
    }
    QUERY WITH FUZZINESS
    254

    View Slide

  255. QUERY WITH FUZZINESS
    curl -X GET -d
    @part-3/surprize-fuzzy-
    query.json
    'localhost:9200/example/
    surprise/_search?pretty'
    255

    View Slide

  256. "hits": [ {
    "_index": "example",
    "_type": "surprise",
    "_id": "1",
    "_score": 0.19178301,
    "_source":{ "text": "Surprise me!"}
    }]
    QUERY WITH FUZZINESS
    256

    View Slide

  257. AUTO-FUZINESS
    ▫︎0 for strings of one or two characters
    ▫︎1 for strings of three, four or five
    characters
    ▫︎2 for strings of more than five
    characters
    257

    View Slide

  258. PART 4
    Data modeling
    258

    View Slide

  259. 4-1 INSIDE A CLUSTER
    259

    View Slide

  260. NODES AND CLUSTERS
    ▫︎A node is a machine running
    Elasticsearch
    ▫︎A cluster is a set of nodes in the same
    network and with the same cluster name
    260

    View Slide

  261. SHARDS
    ▫︎A node stores data inside its shards
    ▫︎Shards are the smallest unit of scale and
    replication
    ▫︎Each shard is a completely independent
    Lucene index
    261

    View Slide

  262. AN EMPTY CLUSTER
    262

    View Slide

  263. GET /_cluster/health
    CLUSTER HEALTH
    263

    View Slide

  264. "cluster_name": "elasticsearch",
    "status": "green",
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 0,
    "active_shards": 0,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0
    CLUSTER HEALTH
    264

    View Slide

  265. PUT /blogs
    "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
    }
    ADD AN INDEX
    265

    View Slide

  266. ADD AN INDEX
    266

    View Slide

  267. GET /_cluster/health
    CLUSTER HEALTH
    267

    View Slide

  268. "cluster_name": "elasticsearch",
    "status": "yellow",
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 3,
    "active_shards": 3,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 3
    CLUSTER HEALTH
    268

    View Slide

  269. ADD A BACKUP NODE
    269

    View Slide

  270. GET /_cluster/health
    CLUSTER HEALTH
    270

    View Slide

  271. "cluster_name": "elasticsearch",
    "status": "green",
    "number_of_nodes": 2,
    "number_of_data_nodes": 2,
    "active_primary_shards": 3,
    "active_shards": 6,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0
    CLUSTER HEALTH
    271

    View Slide

  272. THREE NODES
    272

    View Slide

  273. PUT /blogs
    "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2
    }
    INCREASING THE NUMBER OF REPLICAS
    273

    View Slide

  274. INCREASING THE NUMBER OF REPLICAS
    274

    View Slide

  275. NODE 1 FAILS
    275

    View Slide

  276. CREATING, INDEXING AND DELETING A DOCUMENT
    276

    View Slide

  277. RETRIEVING A DOCUMENT
    277

    View Slide

  278. 4-2 RELATIONSHIPS
    278

    View Slide

  279. RELATIONSHIPS MATTER
    ▫︎Blog Posts 㲗 Comments
    ▫︎Bank Accounts 㲗 Transactions
    ▫︎Orders 㲗 Items
    ▫︎Directories 㲗 Files
    ▫︎…
    279

    View Slide

  280. SQL DATABASES
    ▫︎Entities have an unique primary key
    ▫︎Normalization:
    ▫︎Entity data is stored only once
    ▫︎Entities are referenced by primary key
    ▫︎Updates happen in only one place
    280

    View Slide

  281. ▫︎Entities are joined at query time
    SQL DATABASES
    SELECT Customer.name, Order.status
    FROM Order, Customer
    WHERE Order.customer_id = Customer.id
    281

    View Slide

  282. SQL DATABASES
    ▫︎Changes are ACID
    ▫︎Atomicity
    ▫︎Consistency
    ▫︎Isolation
    ▫︎Durability
    282

    View Slide

  283. ATOMICITY
    ▫︎If one part of the transaction fails, the
    entire transaction fails
    ▫︎…even in the event of power failure,
    crashes or errors
    ▫︎"all or nothing”
    283

    View Slide

  284. CONSISTENCY
    ▫︎Any transaction will bring the database
    from one valid state to another
    ▫︎State must be valid according to all
    defined rules:
    ▫︎Constraints
    ▫︎Cascades
    ▫︎Triggers
    284

    View Slide

  285. ISOLATION
    ▫︎The concurrent execution of transactions
    results in the same state that would be
    obtained if transactions were executed
    serially
    ▫︎Concurrency Control
    285

    View Slide

  286. DURABILITY
    ▫︎A transaction will remain committed
    ▫︎…even in the event of power failure,
    crashes or errors
    ▫︎Non-volatile memory
    286

    View Slide

  287. SQL DATABASES
    ▫︎Joining entities at query time is
    expensive
    ▫︎Impractical with multiple nodes
    287

    View Slide

  288. ELASTICSEARCH
    ▫︎Treats the world as flat
    ▫︎An index is a flat collection of
    independent documents
    ▫︎A single document should contain all
    information to match a search request
    288

    View Slide

  289. ELASTICSEARCH
    ▫︎ACID support for changes on single
    documents
    ▫︎No ACID transactions on multiple
    documents
    289

    View Slide

  290. ELASTICSEARCH
    ▫︎Indexing and searching are fast and
    lock-free
    ▫︎Massive amounts of data can be spread
    across multiple nodes
    290

    View Slide

  291. ELASTICSEARCH
    ▫︎But we need relationships!
    291

    View Slide

  292. ELASTICSEARCH
    ▫︎Application-side joins
    ▫︎Data denormalization
    ▫︎Nested objects
    ▫︎Parent/child relationships
    292

    View Slide

  293. 4-3 APPLICATION-SIDE
    JOINS
    293

    View Slide

  294. APPLICATION-SIDE JOINS
    ▫︎Emulates a relational database
    ▫︎Joins at application level
    ▫︎(index, type, id) = primary key
    294

    View Slide

  295. PUT /example/user/1
    {
    "name": "John Smith",
    "email": "[email protected]",
    "born": "1970-10-24"
    }
    EXAMPLE
    295

    View Slide

  296. PUT /example/blogpost/2
    {
    "title": "Relationships",
    "body": "It's complicated",
    "user": 1
    }
    EXAMPLE
    296

    View Slide

  297. EXAMPLE
    ▫︎(example, user, 1) = primary key
    ▫︎Store only the id
    ▫︎Index and type are hard-coded into the
    application logic
    297

    View Slide

  298. GET /example/blogpost/_search
    "query": {
    "filtered": {
    "filter": {
    "term": { "user": 1 }
    }
    }
    }
    EXAMPLE
    298

    View Slide

  299. EXAMPLE
    ▫︎Blogposts written by “John”:
    ▫︎Find ids of users with name “John”
    ▫︎Find blogposts that match the user ids
    299

    View Slide

  300. GET /example/user/_search
    "query": {
    "match": {
    "name": "John"
    }
    }
    EXAMPLE
    300

    View Slide

  301. ▫︎For each user id from the first query:
    GET /example/blogpost/_search
    "query": {
    "filtered": {
    "filter": {
    "term": { "user": }
    }
    }
    }
    EXAMPLE
    301

    View Slide

  302. ADVANTAGES
    ▫︎Data is normalized
    ▫︎Change user data in just one place
    302

    View Slide

  303. DISADVANTAGES
    ▫︎Run extra queries to join documents
    ▫︎We could have millions of users named
    “John”
    ▫︎Less efficient than SQL joins:
    ▫︎Several API requests
    ▫︎Harder to optimize
    303

    View Slide

  304. WHEN TO USE
    ▫︎First entity has a small number of
    documents and they hardly change
    ▫︎First query results can be cached
    304

    View Slide

  305. 4-4 DATA
    DENORMALIZATION
    305

    View Slide

  306. DATA DENORMALIZATION
    ▫︎No joins
    ▫︎Store redundant copies of the data you
    need to query
    306

    View Slide

  307. PUT /example/user/1
    {
    "name": "John Smith",
    "email": "[email protected]",
    "born": "1970-10-24"
    }
    EXAMPLE
    307

    View Slide

  308. PUT /example/blogpost/2
    {
    "title": "Relationships",
    "body": "It's complicated",
    "user": {
    "id": 1,
    "name": "John Smith"
    }
    }
    EXAMPLE
    308

    View Slide

  309. GET /example/blogpost/_search
    "query": {
    "bool": {
    "must": [
    { "match": {
    "title": "relationships" }},
    { "match": {
    "user.name": "John" }}
    ]}}
    EXAMPLE
    309

    View Slide

  310. ADVANTAGES
    ▫︎Speed
    ▫︎No need for expensive joins
    310

    View Slide

  311. DISADVANTAGES
    ▫︎Uses more disk space (cheap)
    ▫︎Update the same data in several places
    ▫︎scroll and bulk APIs can help
    ▫︎Concurrency issues
    ▫︎Locking can help
    311

    View Slide

  312. WHEN TO USE
    ▫︎Need for fast search
    ▫︎Denormalized data does not change
    very often
    312

    View Slide

  313. 4-5 NESTED OBJECTS
    313

    View Slide

  314. MOTIVATION
    ▫︎Elasticsearch supports ACID when
    updating single documents
    ▫︎Querying related data in the same
    document is faster (no joins)
    ▫︎We want to avoid denormalization
    314

    View Slide

  315. PUT /example/blogpost/1
    {
    "title": "Nest eggs",
    "body": "Making money...",
    "tags": [ "cash", "shares" ],
    "comments": […]
    }
    THE PROBLEM WITH MULTILEVEL OBJECTS
    315

    View Slide

  316. [{
    "name": "John Smith",
    "comment": "Great article",
    "age": 28, "stars": 4,
    "date": "2014-09-01"
    }, {
    "name": "Alice White",
    "comment": "More like this",
    "age": 31,"stars": 5,
    "date": "2014-10-22"
    }]
    THE PROBLEM WITH MULTILEVEL OBJECTS
    316

    View Slide

  317. GET /example/blogpost/_search
    "query": {
    "bool": {
    "must": [
    {"match": {"name": "Alice"}},
    {"match": {"age": "28"}}
    ]}}
    THE PROBLEM WITH MULTILEVEL OBJECTS
    317

    View Slide

  318. [{
    "name": "John Smith",
    "comment": "Great article",
    "age": 28, "stars": 4,
    "date": "2014-09-01"
    }, {
    "name": "Alice White",
    "comment": "More like this",
    "age": 31,"stars": 5,
    "date": "2014-10-22"
    }]
    THE PROBLEM WITH MULTILEVEL OBJECTS
    318

    View Slide

  319. THE PROBLEM WITH MULTILEVEL OBJECTS
    ▫︎Alice is 31, not 28!
    ▫︎It matched the age of John
    ▫︎This is because indexed documents are
    stored as a flattened dictionary
    ▫︎The correlation between Alice and 31 is
    irretrievably lost
    319

    View Slide

  320. {"title": [eggs, nest],
    "body": [making, money],
    "tags": [cash, shares],
    "comments.name":
    [alice, john, smith, white],
    "comments.comment":
    [article, great, like, more, this],
    "comments.age": [28, 31],
    "comments.stars": [4, 5],
    "comments.date":
    [2014-09-01, 2014-10-22]}
    THE PROBLEM WITH MULTILEVEL OBJECTS
    320

    View Slide

  321. NESTED OBJECTS
    ▫︎Nested objects are indexed as hidden
    separate documents
    ▫︎Relationships are preserved
    ▫︎Joining nested documents is very fast
    321

    View Slide

  322. {"comments.name": [john, smith],
    "comments.comment": [article, great],
    "comments.age": [28],
    "comments.stars": [4],
    "comments.date": [2014-09-01]}
    {"comments.name": [alice, white],
    "comments.comment": [like, more, this],
    "comments.age": [31],
    "comments.stars": [5],
    "comments.date": [2014-10-22]}
    NESTED OBJECTS
    322

    View Slide

  323. {
    "title": [eggs, nest],
    "body": [making, money],
    "tags": [cash, shares]
    }
    NESTED OBJECTS
    323

    View Slide

  324. NESTED OBJECTS
    ▫︎Need to be enabled by updating the
    mapping of the index
    324

    View Slide

  325. PUT /example
    "mappings": {
    "blogpost": { "properties": {
    "comments": { "type": "nested",
    "properties": {
    "name": {"type": "string"},
    "comment": {"type": "string"},
    "age": {"type": "short"},
    "stars": {"type":"short"},
    "date": {"type": "date"}
    }}}}}
    MAPPING A NESTED OBJECT
    325

    View Slide

  326. GET /example/blogpost/_search
    "query": {
    "bool": {
    "must": [
    {"match": {"title": "eggs"}}
    {"nested": }
    ]
    }
    }
    QUERYING A NESTED OBJECT
    326

    View Slide

  327. "nested": {
    "path": "comments",
    "query": {
    "bool": {
    "must": [
    {"match":
    {"comments.name": "john"}},
    {"match":
    {"comments.age": 28}}
    ]}}}
    NESTED QUERY
    327

    View Slide

  328. THERE’S MORE
    ▫︎Nested filters
    ▫︎Nested aggregations
    ▫︎Sorting by nested fields
    328

    View Slide

  329. ADVANTAGES
    ▫︎Very fast query-time joins
    ▫︎ACID support (single documents)
    ▫︎Convenient search using nested queries
    329

    View Slide

  330. DISADVANTAGES
    ▫︎To add, change or delete a nested
    object, the whole document must be
    reindexed
    ▫︎Search requests return the whole
    document
    330

    View Slide

  331. WHEN TO USE
    ▫︎When there is one main entity with a
    limited number of closely related entities
    ▫︎Ex: blogposts and comments
    ▫︎Inefficient if there are too many nested
    objects
    331

    View Slide

  332. 4-6 PARENT-CHILD
    RELATIONSHIP
    332

    View Slide

  333. PARENT-CHILD RELATIONSHIP
    ▫︎One-to-many relationship
    ▫︎Similar to the nested model
    ▫︎Nested objects live in the same
    document
    ▫︎Parent and children are completely
    separate documents
    333

    View Slide

  334. EXAMPLE
    ▫︎Company with branches and employees
    ▫︎Branch is the parent
    ▫︎Employee are children
    334

    View Slide

  335. PUT /company
    "mappings": {
    "branch": {},
    "employee": {
    "_parent": {
    "type": "branch"
    }
    }
    }
    EXAMPLE
    335

    View Slide

  336. PUT /company/branch/london
    {
    "name": "London Westminster",
    "city": "London",
    "country": "UK"
    }
    EXAMPLE
    336

    View Slide

  337. PUT /company/employee/1?
    parent=london
    {
    "name": "Alice Smith",
    "born": "1970-10-24",
    "hobby": "hiking"
    }
    EXAMPLE
    337

    View Slide

  338. GET /company/branch/_search
    "query": {
    "has_child": {
    "type": "employee",
    "query": {
    "range": {
    "born": {
    "gte": "1980-01-01" }
    }}}}
    FINDING PARENTS BY THEIR CHILDREN
    338

    View Slide

  339. GET /company/employee/_search
    "query": {
    "has_parent": {
    "type": "branch",
    "query": {
    "match": {
    "country": "UK" }
    }}}
    FINDING CHILDREN BY THEIR PARENTS
    339

    View Slide

  340. THERE’S MORE
    ▫︎min_children and max_children
    ▫︎Children aggregations
    ▫︎Grandparents and grandchildren
    340

    View Slide

  341. ADVANTAGES
    ▫︎Parent document can be updated
    without reindexing the children
    ▫︎Child documents can be updated
    without affecting the parent
    ▫︎Child documents can be returned in
    search results without the parent
    341

    View Slide

  342. ADVANTAGES
    ▫︎Parent and children live on the same
    shard
    ▫︎Faster than application-side joins
    342

    View Slide

  343. DISADVANTAGES
    ▫︎Parent document and all of its children
    must live on the same shard
    ▫︎5 to 10 times slower than nested queries
    343

    View Slide

  344. WHEN TO USE
    ▫︎One-to-many relationships
    ▫︎When index-time is more important
    than search-time performance
    ▫︎Otherwise, use nested objects
    344

    View Slide

  345. REFERENCES
    345

    View Slide

  346. MAIN REFERENCE
    ▫︎Elasticsearch, The
    Definitive guide
    ▫︎Gormley & Tong
    ▫︎O'Reilly
    346

    View Slide

  347. OTHER REFERENCES
    ▫︎"Jepsen: simulating network partitions in
    DBs", http://github.com/aphyr/jepsen
    ▫︎"Call me maybe: Elasticsearch 1.5.0",
    http://aphyr.com/posts/323-call-me-
    maybe-elasticsearch-1-5-0
    ▫︎"Call me maybe: MongoDB stale reads",
    http://aphyr.com/posts/322-call-me-
    maybe-mongodb-stale-reads
    347

    View Slide

  348. OTHER REFERENCES
    ▫︎"Elasticsearch Data Resiliency Status",
    http://www.elastic.co/guide/en/
    elasticsearch/resiliency/current/
    index.html
    ▫︎"Solr vs. Elasticsearch — How to Decide?",
    http://blog.sematext.com/2015/01/30/
    solr-elasticsearch-comparison/
    348

    View Slide

  349. OTHER REFERENCES
    ▫︎"Changing Mapping with Zero Downtime",
    http://www.elastic.co/blog/changing-
    mapping-with-zero-downtime
    349

    View Slide

  350. Felipe Dornelas
    felipedornelas.com
    @felipead
    THANK YOU

    View Slide