Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch: The Missing Tutorial

Elasticsearch: The Missing Tutorial

Erik Rose

July 20, 2014
Tweet

More Decks by Erik Rose

Other Decks in Programming

Transcript

  1. housekeeping • Make sure ES is installed. If you haven’t

    installed it yet and you’re on a Mac, just install 1.1.x.
  2. housekeeping • Make sure ES is installed. If you haven’t

    installed it yet and you’re on a Mac, just install 1.1.x. • Exercise code: clone the git repo at (or just visit) https://github.com/erikrose/oscon-elasticsearch/
  3. housekeeping • Make sure ES is installed. If you haven’t

    installed it yet and you’re on a Mac, just install 1.1.x. • Exercise code: clone the git repo at (or just visit) https://github.com/erikrose/oscon-elasticsearch/ • Make faces.
  4. • Elasticsearch wraps Lucene. • Read/write/admin via REST • Native

    format is JSON (vs XML). lucene++ JSON HTTP on port 9200
  5. • CAP: consistency, availability, partition tolerance • “pick any two”

    • “When it comes to CAP, in a very high level, elasticsearch gives up on partition tolerance” (2010) CAP
  6. • …it’s not that simple • Consistency is mostly eventual.

    • Availability is variable. • Partition tolerant it’s not. • Read http://aphyr.com/posts/317-call-me-maybe-elasticsearch (and despair). CAP
  7. • Generally not suitable as a primary data store. •

    It’s a distributed search engine • Easy to get started • Easy to integrate with your existing web app • Easy to configure it not-too-terribly • Enables fast search with cool features what it’s good for, redux
  8. • node — a machine in your cluster • cluster

    — the set of nodes running ES • master node — Elected by the cluster. If the master fails, another node will take over. nodes and clusters
  9. • shard — A Lucene index. Each piece of data

    you store is written to a primary shard. Primary shards are distributed over the cluster. • replica — Each shard has a set of distributed replicas (copies). Data written to a primary shard is copied to replicas on different nodes. shards and replicas
  10. # Unicast discovery allows to explicitly control which nodes will

    be used # to discover the cluster. It can be used when multicast is not present, # or to restrict the cluster communication-wise. # # 1. Disable multicast discovery (enabled by default): # discovery.zen.ping.multicast.enabled: false exercise: fix clustering and listening # Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens # on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node # communication. (the range means that if the port is busy, it will automatically # try the next port). # Set the bind address specifically (IPv4 or IPv6): # network.bind_host: 127.0.0.1
  11. % cd elasticsearch-1.2.2 % bin/elasticsearch # On the Mac: %

    JAVA_HOME=$(/usr/libexec/java_home -v 1.7) bin/elasticsearch exercise: start up and check % curl -s -XGET 'http://127.0.0.1:9200/_cluster/health?pretty' { "cluster_name" : "grinchertoo", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 19, "active_shards" : 19, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 13 }
  12. IDs

  13. # Make a doc: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{

    "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 }' exercise: make a doc
  14. # Make a doc: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{

    "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 }' # Make sure it's there: curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' exercise: make a doc
  15. # Make a doc: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{

    "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 }' # Make sure it's there: curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' { "_index" : "test", "_type" : "book", "_id" : "1", "_version" : 2, "found" : true, "_source" : { "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 } } exercise: make a doc
  16. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  17. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  18. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  19. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  20. row → 0,1,3 boat → 0,1 chicken → 2 row

    row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3
  21. doc row → 0 [0,1,2] 1 [0,2] 3 [2] boat

    → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3 positions
  22. doc positions row → 0 [0,1,2] 1 [0,2] 3 [2]

    boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3
  23. doc positions row → 0 [0,1,2] 1 [0,2] 3 [2]

    boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3
  24. doc positions row → 0 [0,1,2] 1 [0,2] 3 [2]

    boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3 ?
  25. doc positions row → 0 [0,1,2] 1 [0,2] 3 [2]

    boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3
  26. doc positions row → 0 [0,1,2] 1 [0,2] 3 [2]

    boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3 ?
  27. doc row 232 → 0 [0,1,2] 1 [0,2] 3 [2]

    boat 78 → 0 [4] 1 [3] chicken 91 → 2 [0,1,2] row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3 positions
  28. indices on properties "title": "All About Fish", "author": "Fishy McFishstein",

    "pages": 3015 "title": "Nothing About Pigs", "author": "Nopiggy Nopigman", "pages": 0 "title": "All About Everything", "author": "Everybody", "pages": 4294967295
  29. inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All

    About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }'
  30. inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All

    About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }' title: All About Fish author.name: Fisher McFishstein author.birthday: 1980-02-22 author.favorite_color: green
  31. inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All

    About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }' curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' { "_index" : "test", "_type" : "book", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" }
  32. inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All

    About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }' curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' { "_index" : "test", "_type" : "book", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" }
  33. arrays # Insert a doc containing an array: curl -s

    XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "tag": ["one", "two", "red", "blue"] }'
  34. arrays # Insert a doc containing an array: curl -s

    XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "tag": ["one", "two", "red", "blue"] }' doc one → 1 two → 1 red → 1 blue → 1 ["one", "two", "red", "blue"] doc 1
  35. # Insert a bunch of different docs by changing the

    things in bold: % curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "tag": ["one", "two", "red", "blue"] }' exercise: array play # A sample query--try changing the bold things: % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?pretty' -d '{ "query": { "match_all": {} }, "filter": { "term": {"tag": ["two", "three"]} } }' "red" ["blue"] ["one", "red"] "two"
  36. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", implicit mappings
  37. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, implicit mappings
  38. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, implicit mappings
  39. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, implicit mappings
  40. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, "released": "2013-01-23" }' implicit mappings
  41. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, "released": "2013-01-23" }' # See what kind of mapping ES guessed: curl -s -XGET 'http://127.0.0.1:9200/test/ album/_mapping?pretty' implicit mappings
  42. # Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/

    album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, "released": "2013-01-23" }' # See what kind of mapping ES guessed: curl -s -XGET 'http://127.0.0.1:9200/test/ album/_mapping?pretty' implicit mappings { "test" : { "mappings" : { "album" : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } }
  43. explicit mappings { "test" : { "mappings" : { "album"

    : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } }
  44. explicit mappings { "test" : { "mappings" : { "album"

    : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } } curl -s -XDELETE 'http://127.0.0.1:9200/ test/album'
  45. explicit mappings { "test" : { "mappings" : { "album"

    : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } } { curl -s -XDELETE 'http://127.0.0.1:9200/ test/album'
  46. explicit mappings { "test" : { "mappings" : { "album"

    : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } } curl -s XPUT 'http://127.0.0.1:9200/test/ _mapping/album' -d '{ "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } }' { curl -s -XDELETE 'http://127.0.0.1:9200/ test/album'
  47. 1. Delete the “album” doctype, if you’ve made one by

    following along. 2. Think of an album which would prompt ES to guess a wrong type. 3. Insert it, and GET the _mapping to show the wrong guess. 4. Delete all “album” docs again so you can change the mapping. 5. Set a mapping explicitly so you can’t fool ES anymore. exercise: use explicit mappings
  48. • Query ES via HTTP/REST • Possible to do with

    query string • DSL is better • Let’s write some queries. • But first, let’s get some data in our cluster to query. queries
  49. exercise 1 • Bulk load a small test data set

    to use for querying. • This is exercise_1 in the queries/ directory of the git repo, so you can cut and paste, or execute it directly. % curl -XPOST localhost:9200/_bulk --data-binary @data.bulk
  50. % curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' exercise 2 • Let’s check

    we can pull that data, by grabbing a single document. • This is exercise_2 in the queries/ directory of the repo, so you can cut and paste.
  51. exercise 3 • We’ll begin by using a URI search

    (sometimes called, a little fuzzily, a query string query). • (This is exercise_3) % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?q=title:Python'
  52. • Passes searches via GET in the query string •

    This is fine for running simple queries, basic “is it working” type tests and so on. • Once you have any level of complexity in your query, you’re going to need the query DSL. limited appeal
  53. • DSL == Domain Specific Language • DSL is an

    AST (abstract syntax tree) of queries. • What does that actually mean? • Write your queries in JSON, which can be arbitrarily complex. query DSL
  54. { "query" : { "match" : { "title" : "Python"

    } } } simple DSL term query
  55. • Run this query (exercise 4). % curl -XGET 'http://localhost:9200/test/book/_search'

    -d '{ "query" : { "match" : { "title" : "Python" } } }' (What do you notice about the results?) exercise 4
  56. • Filters: • Boolean: document matches or it does not

    • Order of magnitude faster than queries • Use for exact values • Cacheable queries vs. filters
  57. • Queries: • Use for full text searches • Relevance

    scored Filter when you can; query when you must. queries vs. filters
  58. curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query": { "filtered":

    { "filter": { "term": { "category": "Web Development" } }, "query": { "bool": { "should": [ { "match": { "title": "Python" } }, { "match": { "summary": "Python" } } ] } } } } }' use them together!
  59. exercise 6 • Similar to many relational databases, ElasticSearch supports

    an explainer. Let’s run it on this query. • (This is exercise_6) curl -XGET -s 'http://localhost:9200/test/book/4/_explain?pretty=true' -d \ '{ "query": { "filtered": { "filter": { "term": { "category": "Web Development" } }, "query": { …
  60. exercise 6 results { "_index" : "test", "_type" : "book",

    "_id" : "4", "matched" : false, "explanation" : { "value" : 0.0, "description" : "failure to match filter: cache(category:Web Development)", "details" : [ { …
  61. • This is a classic beginner gotcha. • Using the

    standard analyzer, applied to all fields (by default) “Web Development” will be broken into the terms “web” and “development” and those will be indexed. • The term “Web Development” is not indexed anywhere. analyze that!
  62. • term queries or filters look for an exact match,

    so find nothing • But {“match” : “Web Development”} does work. Why? • match queries or filters use analysis: they break this down into searches for “web” or “development” but match works!
  63. exercise 7 • Let’s make it work. • One solution

    is in exercise_7. • Take a couple minutes before peeking. • TMTOWTDI
  64. • Term queries look for the whole term and are

    not analyzed. • Match queries are analyzed, and look for matches to the analyzed parts of the query. summary: term vs. match
  65. curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query": { "match_phrase":

    { "summary": { "query": "old versions of browsers", "slop": 2 } } } }' match_phrase
  66. • Where are my favorites, AND, OR, and NOT? •

    Tortured syntax of the bool query: • must: everything in the must clause is AND • should: everything in the should clause is OR • should not: you guessed it. • Nest them as much as you like boolean queries
  67. "query": { "bool": { "must": { "bool": { "should": [

    { "match": { "category": "development" } }, { "match": { "category": "programming" } } ] } }, "should": [ { "match": {…
  68. exercise 8 • Run this query - it’s in exercise_8

    • Can you modify it to find books for intermediate or above level programmers?
  69. • Aggregations let you put returned documents into buckets and

    run metrics over those buckets. • Useful for drill down navigation of data. aggregations
  70. exercise 9 curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "size"

    : 0, "aggs" : { "category" : { "terms" : { "field" : "category" } } } }' • Run a sample aggregation - exercise_9
  71. • You can affect the way ES calculates relevance scores

    for results. For example: • Boost: weigh one part of a query more heavily than others • Custom function-scoring queries: e.g. weighting more complete user profiles • Constant score queries: pre-set a score for part of a query (useful for filters!) scoring
  72. boosting "query": { "bool": { "should": [ { "term": {

    "title": { "value": "python", "boost": 2.0 } } }, { "term": { "summary": "python" } } ] } }
  73. function scoring curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query":

    { "function_score": { "query": { "match": { "title": "Python" } }, "script_score": { "script": "_score * doc[\"rating\"].value" } } } }'
  74. • You have various options for writing your functions: •

    Default has been mvel but is now Groovy • Plugins for: • JS • Python • Clojure • mvel scripting languages
  75. stock analyzers original: Red-orange gerbils live at #43A Franklin St.

    whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st
  76. stock analyzers original: Red-orange gerbils live at #43A Franklin St.

    whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords
  77. stock analyzers original: Red-orange gerbils live at #43A Franklin St.

    whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming
  78. stock analyzers original: Red-orange gerbils live at #43A Franklin St.

    whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming • punctuation
  79. stock analyzers original: Red-orange gerbils live at #43A Franklin St.

    whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming • punctuation • case-folding
  80. curl -XGET -s 'http://localhost:9200/_analyze? analyzer=whitespace&pretty=true' -d 'Red-orange gerbils live at

    #43A Franklin St.' { "tokens" : [ { "token" : "Red-orange", "start_offset" : 0, "end_offset" : 10, "type" : "word", "position" : 1 }, { "token" : "gerbils", "start_offset" : 11, "end_offset" : 18, "type" : "word", "position" : 2 }, ...
  81. exercise: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d

    'The word "an" is a stopword.' Hint: Run the above and see what happens.
  82. solution: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d

    'The an is a with that be for to and snookums' { "tokens" : [ { "token" : "snookums", "start_offset" : 36, "end_offset" : 44, "type" : "word", "position" : 11 } ] }
  83. solution: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d

    'The an is a with that be for to and snookums' { "tokens" : [ { "token" : "snookums", "start_offset" : 36, "end_offset" : 44, "type" : "word", "position" : 11 } ] } [0,1,2] [0,2] [2] [4] [3] [0,1,2] positions
  84. applying mappings to properties curl -s XPUT 'http://127.0.0.1:9200/test/_mapping/album' -d '{

    "properties": { "title": { "type": "string" }, "description": { "type": "string", "analyzer": "snowball" }, ... } }'
  85. "analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer",

    "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms
  86. "analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer",

    "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x
  87. "analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer",

    "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x
  88. "analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer",

    "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x O’Brien
  89. "analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer",

    "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x O’Brien
  90. exercise: write a custom analyzer tags: "red, two-headed, striped, really

    dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous'
  91. exercise: write a custom analyzer tags: "red, two-headed, striped, really

    dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous' red two-headed striped really dangerous
  92. exercise: write a custom analyzer tags: "red, two-headed, striped, really

    dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous' red two-headed striped really dangerous curl -s -XGET 'http://127.0.0.1:9200/test/ monster/_search?pretty' -d '{ "query": { "match_all": {} }, "filter": { "term": {"tags": "dangerous"} } }
  93. exercise: write a custom analyzer tags: "red, two-headed, striped, really

    dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous' red two-headed striped really dangerous curl -s -XGET 'http://127.0.0.1:9200/test/ monster/_search?pretty' -d '{ "query": { "match_all": {} }, "filter": { "term": {"tags": "dangerous"} } } { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "monster", "_id" : "1", "_score" : 1.0, "_source" : { "title": "Scarlet Klackinblax", "tags": "red, two-headed, striped, really dangerous" } } ] } }
  94. exercise: write a custom analyzer # How to update the

    "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }'
  95. exercise: write a custom analyzer # How to update the

    "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude'
  96. exercise: write a custom analyzer # How to update the

    "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude' { "error" : "ElasticsearchIllegalArgumentException[Can't update non dynamic settings[[index.analysis.analyzer.comma_delim.filter.0, index.analysis.tokenizer.comma_delim_tokenizer.type, index.analysis.tokenizer.comma_delim_tokenizer.pattern, index.analysis.analyzer.comma_delim.tokenizer]] for open indices[[test]]]", "status" : 400 }
  97. exercise: write a custom analyzer # How to update the

    "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude'
  98. exercise: write a custom analyzer # How to update the

    "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude' curl -s -XPOST 'http://localhost:9200/test/_close'
  99. exercise: write a custom analyzer # How to update the

    "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude' curl -s -XPOST 'http://localhost:9200/test/_close' curl -s -XPOST 'http://localhost:9200/test/_open'
  100. solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d

    '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }'
  101. solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d

    '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }'
  102. solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d

    '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze?analyzer=comma_delim&pretty=true' -d 'red, two- headed, striped, really dangerous' "token": "red" ... "token": "two-headed" ... "token": "striped" ... "token": "really dangerous"
  103. solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d

    '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze?analyzer=comma_delim&pretty=true' -d 'red, two- headed, striped, really dangerous' "token": "red" ... "token": "two-headed" ... "token": "striped" ... "token": "really dangerous"
  104. ngrams 'analyzer': { # A lowercase trigram analyzer 'trigramalyzer': {

    'filter': ['lowercase'], 'tokenizer': 'trigram_tokenizer' } }, 'tokenizer': { 'trigram_tokenizer': { 'type': 'nGram', 'min_gram': 3, 'max_gram': 3 # Keeps all kinds of chars by default. }
  105. exercise: provisioning How would you provision a cluster if we

    were doing lots of CPU- expensive queries on a large corpus, but only a small subset of the corpus was “hot”?
  106. • At least 1 replica • Plenty of shards—but not

    a million • At least 3 nodes. recommendations
  107. • At least 1 replica • Plenty of shards—but not

    a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2
  108. • At least 1 replica • Plenty of shards—but not

    a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky?
  109. • At least 1 replica • Plenty of shards—but not

    a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky? Set fire to the data center and walk away.
  110. • At least 1 replica • Plenty of shards—but not

    a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky? Set fire to the data center and walk away. Or continually repopulate.
  111. • Protect with a firewall, or try elasticsearch-jetty. • discovery.zen.ping.multicast.enabled:

    false • discovery.zen.ping.unicast.hosts: [“master1”, “master2”] too friendly
  112. • Protect with a firewall, or try elasticsearch-jetty. • discovery.zen.ping.multicast.enabled:

    false • discovery.zen.ping.unicast.hosts: [“master1”, “master2”] • cluster.name: something_weird too friendly
  113. adding nodes without downtime • Puppet out new config file:

    discovery.zen.ping.unicast.hosts: ["old.example.com", ..., "new.example.com"]
  114. adding nodes without downtime • Puppet out new config file:

    discovery.zen.ping.unicast.hosts: ["old.example.com", ..., "new.example.com"] • Bring up the new node.
  115. monitoring curl -XGET -s 'http://localhost:9200/_cluster/health?pretty' { "cluster_name" : "grinchertoo", "status"

    : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 29, "active_shards" : 29, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 26 }
  116. monitoring curl -XGET -s 'http://localhost:9200/_cluster/health?pretty' { "cluster_name" : "grinchertoo", "status"

    : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 29, "active_shards" : 29, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 26 } curl -XGET -s 'http://localhost:9200/_cluster/state?pretty' { "cluster_name" : "elasticsearch", "version" : 3, "master_node" : "ACuIytIIQ7G7b_Rg_G7wnA",
  117. exercise: monitoring Why is just checking for cluster color insufficient?

    What could we check in addition? "cluster_name" : "grinchertoo", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 29, "active_shards" : 29, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 26
  118. open files /etc/security/limits.conf: es_user soft nofile 65535 es_user hard nofile

    65535 /etc/init.d/elasticsearch: ulimit -n 65535 ulimit -l unlimited ✚
  119. shrinking indices % vmstat -S m -a 2 procs -----------memory----------

    ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0
  120. shrinking indices % vmstat -S m -a 2 procs -----------memory----------

    ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0 "some_doctype" : { "_source" : {"enabled" : false} }
  121. shrinking indices % vmstat -S m -a 2 procs -----------memory----------

    ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0 "some_doctype" : { "_all" : {"enabled" : false} }
  122. shrinking indices % vmstat -S m -a 2 procs -----------memory----------

    ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0 "some_doctype" : { "some_field" : {"include_in_all" : false} }
  123. filter caching "filter": { "terms": { "tags": ["red", "green"], "execution":

    "plain" } } "filter": { "terms": { "tags": ["red", "green"], "execution": "bool" } }
  124. • Use Bulk API. how to reindex • Turn off

    auto-refresh: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "-1" } }' • index.merge.policy.merge_factor: 1000 • Remove replicas if you can. • Use multiple feeder processes. • Put everything back.
  125. • Backups used to be fairly cumbersome but now there’s

    an API for that! • Set it up: curl -XPUT 'http://localhost:9200/_snapshot/backups' -d '{ "type": "fs", "settings": { "location": "/somewhere/backups", "compress": true } }' • Run a backup: curl -XPUT "localhost:9200/_snapshot/backups/july20" backups
  126. synonyms "filter": { "synonym": { "type": "synonym", "synonyms": [ "albert

    => albert, al", "allan => allan, al" ] } } original query: Allan Smith after synonyms: [allan, al] smith original query: Albert Smith after synonyms: [albert, al] smith
  127. • You can set up synonyms at indexing or at

    query time. • For all that’s beautiful in this world, do it at query time. • At indexing explodes your data size. • You can store synonyms in a file, and reference that file in your mapping. • Many gotchas. • Undocumented limits on the file. • Needs to be uploaded to the config dir on each node. synonym gotchas
  128. • Use to suggest possible search terms, or complete queries

    • Types: • Term and Phrase - will do spelling corrections • Completion - for autocomplete • Context - limit suggestions to a subset suggesters
  129. • Why? Hook your query up to JS and query-as-they-type

    • Completion suggester (faster, newer, slightly cumbersome) • Prefix queries (slower, older, more reliable) • Both require mapping changes to work autocompletion
  130. curl -X POST 'localhost:9200/test/books/_suggest?pretty' -d '{ "title-suggest" : { "text"

    : "p", "completion" : { "field" : "suggest" } } }' suggester autocompletion