Slide 1

Slide 1 text

lastic the missing tutorial lastic Erik Rose (@ErikRose) Laura Thomson (@lxt) Mozilla earch earch

Slide 2

Slide 2 text

lastic the missing tutorial lastic Erik Rose (@ErikRose) Laura Thomson (@lxt) Mozilla earch earch

Slide 3

Slide 3 text

housekeeping

Slide 4

Slide 4 text

housekeeping • Make sure ES is installed. If you haven’t installed it yet and you’re on a Mac, just install 1.1.x.

Slide 5

Slide 5 text

housekeeping • Make sure ES is installed. If you haven’t installed it yet and you’re on a Mac, just install 1.1.x. • Exercise code: clone the git repo at (or just visit) https://github.com/erikrose/oscon-elasticsearch/

Slide 6

Slide 6 text

housekeeping • Make sure ES is installed. If you haven’t installed it yet and you’re on a Mac, just install 1.1.x. • Exercise code: clone the git repo at (or just visit) https://github.com/erikrose/oscon-elasticsearch/ • Make faces.

Slide 7

Slide 7 text

what it’s good for

Slide 8

Slide 8 text

• Full-text search what it’s good for

Slide 9

Slide 9 text

• Full-text search • Big data what it’s good for

Slide 10

Slide 10 text

• Full-text search • Big data • Faceting what it’s good for

Slide 11

Slide 11 text

• Full-text search • Big data • Faceting • Geographical queries what it’s good for

Slide 12

Slide 12 text

Shay Banon, Heavy Lifter

Slide 13

Slide 13 text

the rest of us ?

Slide 14

Slide 14 text

the rest of us ?

Slide 15

Slide 15 text

characteristics

Slide 16

Slide 16 text

• Elasticsearch wraps Lucene. • Read/write/admin via REST • Native format is JSON (vs XML). lucene++ JSON HTTP on port 9200

Slide 17

Slide 17 text

• CAP: consistency, availability, partition tolerance • “pick any two” • “When it comes to CAP, in a very high level, elasticsearch gives up on partition tolerance” (2010) CAP

Slide 18

Slide 18 text

• …it’s not that simple • Consistency is mostly eventual. • Availability is variable. • Partition tolerant it’s not. • Read http://aphyr.com/posts/317-call-me-maybe-elasticsearch (and despair). CAP

Slide 19

Slide 19 text

• Generally not suitable as a primary data store. • It’s a distributed search engine • Easy to get started • Easy to integrate with your existing web app • Easy to configure it not-too-terribly • Enables fast search with cool features what it’s good for, redux

Slide 20

Slide 20 text

definitions

Slide 21

Slide 21 text

• node — a machine in your cluster • cluster — the set of nodes running ES • master node — Elected by the cluster. If the master fails, another node will take over. nodes and clusters

Slide 22

Slide 22 text

• shard — A Lucene index. Each piece of data you store is written to a primary shard. Primary shards are distributed over the cluster. • replica — Each shard has a set of distributed replicas (copies). Data written to a primary shard is copied to replicas on different nodes. shards and replicas

Slide 23

Slide 23 text

self-defense

Slide 24

Slide 24 text

# Unicast discovery allows to explicitly control which nodes will be used # to discover the cluster. It can be used when multicast is not present, # or to restrict the cluster communication-wise. # # 1. Disable multicast discovery (enabled by default): # discovery.zen.ping.multicast.enabled: false exercise: fix clustering and listening # Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens # on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node # communication. (the range means that if the port is busy, it will automatically # try the next port). # Set the bind address specifically (IPv4 or IPv6): # network.bind_host: 127.0.0.1

Slide 25

Slide 25 text

% cd elasticsearch-1.2.2 % bin/elasticsearch # On the Mac: % JAVA_HOME=$(/usr/libexec/java_home -v 1.7) bin/elasticsearch exercise: start up and check % curl -s -XGET 'http://127.0.0.1:9200/_cluster/health?pretty' { "cluster_name" : "grinchertoo", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 19, "active_shards" : 19, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 13 }

Slide 26

Slide 26 text

exercise: tool up curl

Slide 27

Slide 27 text

exercise: tool up BBEdit’s shell worksheets: http://pine.barebones.com/files/BBEdit_10.5.11.dmg

Slide 28

Slide 28 text

exercise: tool up Marvel/Sense: http://www.elasticsearch.org/overview/marvel/download/

Slide 29

Slide 29 text

exercise: tool up Marvel/Sense: http://www.elasticsearch.org/overview/marvel/download/

Slide 30

Slide 30 text

data structure basics

Slide 31

Slide 31 text

index

Slide 32

Slide 32 text

index doctype

Slide 33

Slide 33 text

index doctype

Slide 34

Slide 34 text

index doctype {… }

Slide 35

Slide 35 text

index doctype another doctype {… }

Slide 36

Slide 36 text

index doctype another doctype {… }

Slide 37

Slide 37 text

curl -s -XPUT 'http://localhost:9200/test/' exercise: make an index

Slide 38

Slide 38 text

IDs

Slide 39

Slide 39 text

IDs 6a8ca01c-7896-48e9- 81cc-9f70661fcb32

Slide 40

Slide 40 text

exercise: make a doc

Slide 41

Slide 41 text

# Make a doc: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 }' exercise: make a doc

Slide 42

Slide 42 text

# Make a doc: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 }' # Make sure it's there: curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' exercise: make a doc

Slide 43

Slide 43 text

# Make a doc: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 }' # Make sure it's there: curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' { "_index" : "test", "_type" : "book", "_id" : "1", "_version" : 2, "found" : true, "_source" : { "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 } } exercise: make a doc

Slide 44

Slide 44 text

# Delete the doc: curl -s -XDELETE 'http://localhost:9200/test/book/1' exercise: make a doc

Slide 45

Slide 45 text

diplodocus …………………………… 333 duodenum …………………………… 201 dwaal …………………………… 500, 119

Slide 46

Slide 46 text

row → 0,1,3 boat → 0,1 chicken → 2 row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3

Slide 47

Slide 47 text

row → 0,1,3 boat → 0,1 chicken → 2 row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3

Slide 48

Slide 48 text

row → 0,1,3 boat → 0,1 chicken → 2 row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3

Slide 49

Slide 49 text

row → 0,1,3 boat → 0,1 chicken → 2 row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3

Slide 50

Slide 50 text

row → 0,1,3 boat → 0,1 chicken → 2 row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3

Slide 51

Slide 51 text

doc row → 0 [0,1,2] 1 [0,2] 3 [2] boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3 positions

Slide 52

Slide 52 text

doc positions row → 0 [0,1,2] 1 [0,2] 3 [2] boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3

Slide 53

Slide 53 text

doc positions row → 0 [0,1,2] 1 [0,2] 3 [2] boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3

Slide 54

Slide 54 text

doc positions row → 0 [0,1,2] 1 [0,2] 3 [2] boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3 ?

Slide 55

Slide 55 text

doc positions row → 0 [0,1,2] 1 [0,2] 3 [2] boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3

Slide 56

Slide 56 text

doc positions row → 0 [0,1,2] 1 [0,2] 3 [2] boat → 0 [4] 1 [3] chicken → 2 [0,1,2] row row row your boat row the row boat 0 1 chicken chicken chicken the front row 2 3 ?

Slide 57

Slide 57 text

doc row 232 → 0 [0,1,2] 1 [0,2] 3 [2] boat 78 → 0 [4] 1 [3] chicken 91 → 2 [0,1,2] row row row your boat row the row boat chicken chicken chicken the front row 0 1 2 3 positions

Slide 58

Slide 58 text

indices on properties "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 "title": "Nothing About Pigs", "author": "Nopiggy Nopigman", "pages": 0 "title": "All About Everything", "author": "Everybody", "pages": 4294967295

Slide 59

Slide 59 text

inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }'

Slide 60

Slide 60 text

inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }' title: All About Fish author.name: Fisher McFishstein author.birthday: 1980-02-22 author.favorite_color: green

Slide 61

Slide 61 text

inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }' curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' { "_index" : "test", "_type" : "book", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" }

Slide 62

Slide 62 text

inner objects curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{ "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" } }' curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' { "_index" : "test", "_type" : "book", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title": "All About Fish", "author": { "name": "Fisher McFishstein", "birthday": "1980-02-22", "favorite_color": "green" }

Slide 63

Slide 63 text

arrays # Insert a doc containing an array: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "tag": ["one", "two", "red", "blue"] }'

Slide 64

Slide 64 text

arrays # Insert a doc containing an array: curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "tag": ["one", "two", "red", "blue"] }' doc one → 1 two → 1 red → 1 blue → 1 ["one", "two", "red", "blue"] doc 1

Slide 65

Slide 65 text

# Insert a bunch of different docs by changing the things in bold: % curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{ "title": "All About Fish", "tag": ["one", "two", "red", "blue"] }' exercise: array play # A sample query--try changing the bold things: % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?pretty' -d '{ "query": { "match_all": {} }, "filter": { "term": {"tag": ["two", "three"]} } }' "red" ["blue"] ["one", "red"] "two"

Slide 66

Slide 66 text

mappings

Slide 67

Slide 67 text

implicit mappings

Slide 68

Slide 68 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ implicit mappings

Slide 69

Slide 69 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", implicit mappings

Slide 70

Slide 70 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, implicit mappings

Slide 71

Slide 71 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, implicit mappings

Slide 72

Slide 72 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, implicit mappings

Slide 73

Slide 73 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, "released": "2013-01-23" }' implicit mappings

Slide 74

Slide 74 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, "released": "2013-01-23" }' # See what kind of mapping ES guessed: curl -s -XGET 'http://127.0.0.1:9200/test/ album/_mapping?pretty' implicit mappings

Slide 75

Slide 75 text

# Make a new album doc: curl -s XPUT 'http://127.0.0.1:9200/test/ album/1' -d '{ "title": "Fish Sounds", "gapless_playback": true, "length_seconds": 210000, "weight": 1.22, "released": "2013-01-23" }' # See what kind of mapping ES guessed: curl -s -XGET 'http://127.0.0.1:9200/test/ album/_mapping?pretty' implicit mappings { "test" : { "mappings" : { "album" : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } }

Slide 76

Slide 76 text

explicit mappings { "test" : { "mappings" : { "album" : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } }

Slide 77

Slide 77 text

explicit mappings { "test" : { "mappings" : { "album" : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } } curl -s -XDELETE 'http://127.0.0.1:9200/ test/album'

Slide 78

Slide 78 text

explicit mappings { "test" : { "mappings" : { "album" : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } } { curl -s -XDELETE 'http://127.0.0.1:9200/ test/album'

Slide 79

Slide 79 text

explicit mappings { "test" : { "mappings" : { "album" : { "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } } } } curl -s XPUT 'http://127.0.0.1:9200/test/ _mapping/album' -d '{ "properties" : { "title" : { "type" : "string" }, "gapless_playback" : { "type" : "boolean" }, "length_seconds" : { "type" : "long" }, "weight" : { "type" : "double" }, "released" : { "type" : "date", "format" : "dateOptionalTime" } } }' { curl -s -XDELETE 'http://127.0.0.1:9200/ test/album'

Slide 80

Slide 80 text

1. Delete the “album” doctype, if you’ve made one by following along. 2. Think of an album which would prompt ES to guess a wrong type. 3. Insert it, and GET the _mapping to show the wrong guess. 4. Delete all “album” docs again so you can change the mapping. 5. Set a mapping explicitly so you can’t fool ES anymore. exercise: use explicit mappings

Slide 81

Slide 81 text

Lurking Horrors

Slide 82

Slide 82 text

queries

Slide 83

Slide 83 text

• Query ES via HTTP/REST • Possible to do with query string • DSL is better • Let’s write some queries. • But first, let’s get some data in our cluster to query. queries

Slide 84

Slide 84 text

exercise 1 • Bulk load a small test data set to use for querying. • This is exercise_1 in the queries/ directory of the git repo, so you can cut and paste, or execute it directly. % curl -XPOST localhost:9200/_bulk --data-binary @data.bulk

Slide 85

Slide 85 text

% curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' exercise 2 • Let’s check we can pull that data, by grabbing a single document. • This is exercise_2 in the queries/ directory of the repo, so you can cut and paste.

Slide 86

Slide 86 text

exercise 3 • We’ll begin by using a URI search (sometimes called, a little fuzzily, a query string query). • (This is exercise_3) % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?q=title:Python'

Slide 87

Slide 87 text

• Passes searches via GET in the query string • This is fine for running simple queries, basic “is it working” type tests and so on. • Once you have any level of complexity in your query, you’re going to need the query DSL. limited appeal

Slide 88

Slide 88 text

• DSL == Domain Specific Language • DSL is an AST (abstract syntax tree) of queries. • What does that actually mean? • Write your queries in JSON, which can be arbitrarily complex. query DSL

Slide 89

Slide 89 text

{ "query" : { "match" : { "title" : "Python" } } } simple DSL term query

Slide 90

Slide 90 text

• Run this query (exercise 4). % curl -XGET 'http://localhost:9200/test/book/_search' -d '{ "query" : { "match" : { "title" : "Python" } } }' (What do you notice about the results?) exercise 4

Slide 91

Slide 91 text

• Filters: • Boolean: document matches or it does not • Order of magnitude faster than queries • Use for exact values • Cacheable queries vs. filters

Slide 92

Slide 92 text

• Queries: • Use for full text searches • Relevance scored Filter when you can; query when you must. queries vs. filters

Slide 93

Slide 93 text

curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query": { "filtered": { "filter": { "term": { "category": "Web Development" } }, "query": { "bool": { "should": [ { "match": { "title": "Python" } }, { "match": { "summary": "Python" } } ] } } } } }' use them together!

Slide 94

Slide 94 text

exercise 5 • Let’s run that query. • (This is exercise_5)

Slide 95

Slide 95 text

exercise 5 results • Where are my results???

Slide 96

Slide 96 text

exercise 6 • Similar to many relational databases, ElasticSearch supports an explainer. Let’s run it on this query. • (This is exercise_6) curl -XGET -s 'http://localhost:9200/test/book/4/_explain?pretty=true' -d \ '{ "query": { "filtered": { "filter": { "term": { "category": "Web Development" } }, "query": { …

Slide 97

Slide 97 text

exercise 6 results { "_index" : "test", "_type" : "book", "_id" : "4", "matched" : false, "explanation" : { "value" : 0.0, "description" : "failure to match filter: cache(category:Web Development)", "details" : [ { …

Slide 98

Slide 98 text

• This is a classic beginner gotcha. • Using the standard analyzer, applied to all fields (by default) “Web Development” will be broken into the terms “web” and “development” and those will be indexed. • The term “Web Development” is not indexed anywhere. analyze that!

Slide 99

Slide 99 text

• term queries or filters look for an exact match, so find nothing • But {“match” : “Web Development”} does work. Why? • match queries or filters use analysis: they break this down into searches for “web” or “development” but match works!

Slide 100

Slide 100 text

exercise 7 • Let’s make it work. • One solution is in exercise_7. • Take a couple minutes before peeking. • TMTOWTDI

Slide 101

Slide 101 text

• Term queries look for the whole term and are not analyzed. • Match queries are analyzed, and look for matches to the analyzed parts of the query. summary: term vs. match

Slide 102

Slide 102 text

curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query": { "match_phrase": { "summary": { "query": "old versions of browsers", "slop": 2 } } } }' match_phrase

Slide 103

Slide 103 text

• Where are my favorites, AND, OR, and NOT? • Tortured syntax of the bool query: • must: everything in the must clause is AND • should: everything in the should clause is OR • should not: you guessed it. • Nest them as much as you like boolean queries

Slide 104

Slide 104 text

• minimum_should_match is the number of should clauses that have to match. boolean bonuses

Slide 105

Slide 105 text

"query": { "bool": { "must": { "bool": { "should": [ { "match": { "category": "development" } }, { "match": { "category": "programming" } } ] } }, "should": [ { "match": {…

Slide 106

Slide 106 text

exercise 8 • Run this query - it’s in exercise_8 • Can you modify it to find books for intermediate or above level programmers?

Slide 107

Slide 107 text

• We’re actually not going to cover faceting - deprecated in favor of aggregations. faceting

Slide 108

Slide 108 text

• Aggregations let you put returned documents into buckets and run metrics over those buckets. • Useful for drill down navigation of data. aggregations

Slide 109

Slide 109 text

exercise 9 curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "size" : 0, "aggs" : { "category" : { "terms" : { "field" : "category" } } } }' • Run a sample aggregation - exercise_9

Slide 110

Slide 110 text

• You can affect the way ES calculates relevance scores for results. For example: • Boost: weigh one part of a query more heavily than others • Custom function-scoring queries: e.g. weighting more complete user profiles • Constant score queries: pre-set a score for part of a query (useful for filters!) scoring

Slide 111

Slide 111 text

boosting "query": { "bool": { "should": [ { "term": { "title": { "value": "python", "boost": 2.0 } } }, { "term": { "summary": "python" } } ] } }

Slide 112

Slide 112 text

function scoring curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query": { "function_score": { "query": { "match": { "title": "Python" } }, "script_score": { "script": "_score * doc[\"rating\"].value" } } } }'

Slide 113

Slide 113 text

• You have various options for writing your functions: • Default has been mvel but is now Groovy • Plugins for: • JS • Python • Clojure • mvel scripting languages

Slide 114

Slide 114 text

analysis

Slide 115

Slide 115 text

stock analyzers original: Red-orange gerbils live at #43A Franklin St. whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st

Slide 116

Slide 116 text

stock analyzers original: Red-orange gerbils live at #43A Franklin St. whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords

Slide 117

Slide 117 text

stock analyzers original: Red-orange gerbils live at #43A Franklin St. whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming

Slide 118

Slide 118 text

stock analyzers original: Red-orange gerbils live at #43A Franklin St. whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming • punctuation

Slide 119

Slide 119 text

stock analyzers original: Red-orange gerbils live at #43A Franklin St. whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming • punctuation • case-folding

Slide 120

Slide 120 text

curl -XGET -s 'http://localhost:9200/_analyze? analyzer=whitespace&pretty=true' -d 'Red-orange gerbils live at #43A Franklin St.' { "tokens" : [ { "token" : "Red-orange", "start_offset" : 0, "end_offset" : 10, "type" : "word", "position" : 1 }, { "token" : "gerbils", "start_offset" : 11, "end_offset" : 18, "type" : "word", "position" : 2 }, ...

Slide 121

Slide 121 text

exercise: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d 'The word "an" is a stopword.' Hint: Run the above and see what happens.

Slide 122

Slide 122 text

solution: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d 'The an is a with that be for to and snookums'

Slide 123

Slide 123 text

solution: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d 'The an is a with that be for to and snookums' { "tokens" : [ { "token" : "snookums", "start_offset" : 36, "end_offset" : 44, "type" : "word", "position" : 11 } ] }

Slide 124

Slide 124 text

solution: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d 'The an is a with that be for to and snookums' { "tokens" : [ { "token" : "snookums", "start_offset" : 36, "end_offset" : 44, "type" : "word", "position" : 11 } ] } [0,1,2] [0,2] [2] [4] [3] [0,1,2] positions

Slide 125

Slide 125 text

applying mappings to properties curl -s XPUT 'http://127.0.0.1:9200/test/_mapping/album' -d '{ "properties": { "title": { "type": "string" }, "description": { "type": "string", "analyzer": "snowball" }, ... } }'

Slide 126

Slide 126 text

analyzer internals name_analyzer CharFilter Tokenizer Token Filter terms

Slide 127

Slide 127 text

analyzer internals name_analyzer CharFilter Tokenizer Token Filter terms O Brien ’

Slide 128

Slide 128 text

analyzer internals name_analyzer CharFilter Tokenizer Token Filter terms O Brien

Slide 129

Slide 129 text

"analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer", "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms

Slide 130

Slide 130 text

"analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer", "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x

Slide 131

Slide 131 text

"analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer", "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x

Slide 132

Slide 132 text

"analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer", "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x O’Brien

Slide 133

Slide 133 text

"analysis": { "analyzer": { "name_analyzer": { "type": "custom", "tokenizer": "name_tokenizer", "filter": ["lowercase"] } }, "tokenizer": { "name_tokenizer": { "type": "pattern", "pattern": "[^a-zA-Z']+" } } } name_analyzer CharFilter Tokenizer Token Filter terms x O’Brien

Slide 134

Slide 134 text

exercise: write a custom analyzer

Slide 135

Slide 135 text

exercise: write a custom analyzer tags: "red, two-headed, striped, really dangerous"

Slide 136

Slide 136 text

exercise: write a custom analyzer tags: "red, two-headed, striped, really dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous'

Slide 137

Slide 137 text

exercise: write a custom analyzer tags: "red, two-headed, striped, really dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous' red two-headed striped really dangerous

Slide 138

Slide 138 text

exercise: write a custom analyzer tags: "red, two-headed, striped, really dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous' red two-headed striped really dangerous curl -s -XGET 'http://127.0.0.1:9200/test/ monster/_search?pretty' -d '{ "query": { "match_all": {} }, "filter": { "term": {"tags": "dangerous"} } }

Slide 139

Slide 139 text

exercise: write a custom analyzer tags: "red, two-headed, striped, really dangerous" curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d 'red, two-headed, striped, really dangerous' red two-headed striped really dangerous curl -s -XGET 'http://127.0.0.1:9200/test/ monster/_search?pretty' -d '{ "query": { "match_all": {} }, "filter": { "term": {"tags": "dangerous"} } } { "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "monster", "_id" : "1", "_score" : 1.0, "_source" : { "title": "Scarlet Klackinblax", "tags": "red, two-headed, striped, really dangerous" } } ] } }

Slide 140

Slide 140 text

exercise: write a custom analyzer # How to update the "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }'

Slide 141

Slide 141 text

exercise: write a custom analyzer # How to update the "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude'

Slide 142

Slide 142 text

exercise: write a custom analyzer # How to update the "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude' { "error" : "ElasticsearchIllegalArgumentException[Can't update non dynamic settings[[index.analysis.analyzer.comma_delim.filter.0, index.analysis.tokenizer.comma_delim_tokenizer.type, index.analysis.tokenizer.comma_delim_tokenizer.pattern, index.analysis.analyzer.comma_delim.tokenizer]] for open indices[[test]]]", "status" : 400 }

Slide 143

Slide 143 text

exercise: write a custom analyzer # How to update the "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude'

Slide 144

Slide 144 text

exercise: write a custom analyzer # How to update the "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude' curl -s -XPOST 'http://localhost:9200/test/_close'

Slide 145

Slide 145 text

exercise: write a custom analyzer # How to update the "test" index's analyzers: curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "whitespace_analyzer": { "filter": ["lowercase"], "tokenizer": "whitespace_tokenizer" } }, "tokenizer": { "whitespace_tokenizer": { "type": "pattern", "pattern": " +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze? analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude' curl -s -XPOST 'http://localhost:9200/test/_close' curl -s -XPOST 'http://localhost:9200/test/_open'

Slide 146

Slide 146 text

solution: write a custom analyzer

Slide 147

Slide 147 text

solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }'

Slide 148

Slide 148 text

solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }'

Slide 149

Slide 149 text

solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze?analyzer=comma_delim&pretty=true' -d 'red, two- headed, striped, really dangerous' "token": "red" ... "token": "two-headed" ... "token": "striped" ... "token": "really dangerous"

Slide 150

Slide 150 text

solution: write a custom analyzer curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{ "analysis": { "analyzer": { "comma_delim": { "filter": ["lowercase"], "tokenizer": "comma_delim_tokenizer" } }, "tokenizer": { "comma_delim_tokenizer": { "type": "pattern", "pattern": ", +" } } } }' curl -XGET -s 'http://localhost:9200/test/_analyze?analyzer=comma_delim&pretty=true' -d 'red, two- headed, striped, really dangerous' "token": "red" ... "token": "two-headed" ... "token": "striped" ... "token": "really dangerous"

Slide 151

Slide 151 text

ngrams 'analyzer': { # A lowercase trigram analyzer 'trigramalyzer': { 'filter': ['lowercase'], 'tokenizer': 'trigram_tokenizer' } }, 'tokenizer': { 'trigram_tokenizer': { 'type': 'nGram', 'min_gram': 3, 'max_gram': 3 # Keeps all kinds of chars by default. }

Slide 152

Slide 152 text

ngrams “Chemieingenieurwesen ”

Slide 153

Slide 153 text

ngrams “Chemieingenieurwesen ” …ing nge gen eni nie ieu eur…

Slide 154

Slide 154 text

ngrams “Chemieingenieurwesen ” …ing nge gen eni nie ieu eur…

Slide 155

Slide 155 text

clustering

Slide 156

Slide 156 text

shards

Slide 157

Slide 157 text

shards curl -XPUT 'http://localhost:9200/twitter/' -d ' index: number_of_shards: 3 '

Slide 158

Slide 158 text

replicas curl -XPUT 'http://localhost:9200/twitter/' -d ' index: number_of_shards: 3 number_of_replicas: 2 '

Slide 159

Slide 159 text

exercise: provisioning How would you provision a cluster if we were doing lots of CPU- expensive queries on a large corpus, but only a small subset of the corpus was “hot”?

Slide 160

Slide 160 text

extremer extremes

Slide 161

Slide 161 text

recommendations

Slide 162

Slide 162 text

• At least 1 replica recommendations

Slide 163

Slide 163 text

• At least 1 replica • Plenty of shards—but not a million recommendations

Slide 164

Slide 164 text

• At least 1 replica • Plenty of shards—but not a million • At least 3 nodes. recommendations

Slide 165

Slide 165 text

• At least 1 replica • Plenty of shards—but not a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2

Slide 166

Slide 166 text

• At least 1 replica • Plenty of shards—but not a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky?

Slide 167

Slide 167 text

• At least 1 replica • Plenty of shards—but not a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky? Set fire to the data center and walk away.

Slide 168

Slide 168 text

• At least 1 replica • Plenty of shards—but not a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky? Set fire to the data center and walk away. Or continually repopulate.

Slide 169

Slide 169 text

real-life examples

Slide 170

Slide 170 text

real-life examples

Slide 171

Slide 171 text

real-life examples

Slide 172

Slide 172 text

too friendly

Slide 173

Slide 173 text

• Protect with a firewall, or try elasticsearch-jetty. too friendly

Slide 174

Slide 174 text

• Protect with a firewall, or try elasticsearch-jetty. • discovery.zen.ping.multicast.enabled: false too friendly

Slide 175

Slide 175 text

• Protect with a firewall, or try elasticsearch-jetty. • discovery.zen.ping.multicast.enabled: false • discovery.zen.ping.unicast.hosts: [“master1”, “master2”] too friendly

Slide 176

Slide 176 text

• Protect with a firewall, or try elasticsearch-jetty. • discovery.zen.ping.multicast.enabled: false • discovery.zen.ping.unicast.hosts: [“master1”, “master2”] • cluster.name: something_weird too friendly

Slide 177

Slide 177 text

adding nodes without downtime

Slide 178

Slide 178 text

adding nodes without downtime • Puppet out new config file: discovery.zen.ping.unicast.hosts: ["old.example.com", ..., "new.example.com"]

Slide 179

Slide 179 text

adding nodes without downtime • Puppet out new config file: discovery.zen.ping.unicast.hosts: ["old.example.com", ..., "new.example.com"] • Bring up the new node.

Slide 180

Slide 180 text

beware inconsistent config

Slide 181

Slide 181 text

be wary of upgrades

Slide 182

Slide 182 text

monitoring curl -XGET -s 'http://localhost:9200/_cluster/health?pretty' { "cluster_name" : "grinchertoo", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 29, "active_shards" : 29, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 26 }

Slide 183

Slide 183 text

monitoring curl -XGET -s 'http://localhost:9200/_cluster/health?pretty' { "cluster_name" : "grinchertoo", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 29, "active_shards" : 29, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 26 } curl -XGET -s 'http://localhost:9200/_cluster/state?pretty' { "cluster_name" : "elasticsearch", "version" : 3, "master_node" : "ACuIytIIQ7G7b_Rg_G7wnA",

Slide 184

Slide 184 text

exercise: monitoring Why is just checking for cluster color insufficient? What could we check in addition? "cluster_name" : "grinchertoo", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 29, "active_shards" : 29, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 26

Slide 185

Slide 185 text

monitoring: elasticsearch-paramedic http://karmi.github.com/elasticsearch-paramedic/

Slide 186

Slide 186 text

monitoring: elasticsearch-paramedic http://karmi.github.com/elasticsearch-paramedic/

Slide 187

Slide 187 text

monitoring: marvel http://www.elasticsearch.org/overview/marvel/

Slide 188

Slide 188 text

monitoring: marvel http://www.elasticsearch.org/overview/marvel/

Slide 189

Slide 189 text

optimization

Slide 190

Slide 190 text

bootstrap.mlockall: true

Slide 191

Slide 191 text

ES_HEAP_SIZE: half of RAM

Slide 192

Slide 192 text

open files

Slide 193

Slide 193 text

open files /etc/security/limits.conf: es_user soft nofile 65535 es_user hard nofile 65535

Slide 194

Slide 194 text

open files /etc/security/limits.conf: es_user soft nofile 65535 es_user hard nofile 65535 /etc/init.d/elasticsearch: ulimit -n 65535 ulimit -l unlimited ✚

Slide 195

Slide 195 text

Use default stores.

Slide 196

Slide 196 text

RAM & JVM tuning

Slide 197

Slide 197 text

MySQL

Slide 198

Slide 198 text

shrinking indices

Slide 199

Slide 199 text

shrinking indices % vmstat -S m -a 2 procs -----------memory---------- ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0

Slide 200

Slide 200 text

shrinking indices % vmstat -S m -a 2 procs -----------memory---------- ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0 "some_doctype" : { "_source" : {"enabled" : false} }

Slide 201

Slide 201 text

shrinking indices % vmstat -S m -a 2 procs -----------memory---------- ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0 "some_doctype" : { "_all" : {"enabled" : false} }

Slide 202

Slide 202 text

shrinking indices % vmstat -S m -a 2 procs -----------memory---------- ---swap-- -----io---- r b swpd free inact active si so bi bo 1 0 4 37 54 55 0 0 0 1 0 0 4 37 54 55 0 0 0 0 0 0 4 37 54 55 0 0 0 0 "some_doctype" : { "some_field" : {"include_in_all" : false} }

Slide 203

Slide 203 text

filter caching

Slide 204

Slide 204 text

filter caching "filter": { "terms": { "tags": ["red", "green"], "execution": "plain" } }

Slide 205

Slide 205 text

filter caching "filter": { "terms": { "tags": ["red", "green"], "execution": "plain" } } "filter": { "terms": { "tags": ["red", "green"], "execution": "bool" } }

Slide 206

Slide 206 text

dealing with the future

Slide 207

Slide 207 text

mappings

Slide 208

Slide 208 text

expensive updates

Slide 209

Slide 209 text

• Use Bulk API. how to reindex • Turn off auto-refresh: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "-1" } }' • index.merge.policy.merge_factor: 1000 • Remove replicas if you can. • Use multiple feeder processes. • Put everything back.

Slide 210

Slide 210 text

• Backups used to be fairly cumbersome but now there’s an API for that! • Set it up: curl -XPUT 'http://localhost:9200/_snapshot/backups' -d '{ "type": "fs", "settings": { "location": "/somewhere/backups", "compress": true } }' • Run a backup: curl -XPUT "localhost:9200/_snapshot/backups/july20" backups

Slide 211

Slide 211 text

fancy & advanced features

Slide 212

Slide 212 text

synonyms "filter": { "synonym": { "type": "synonym", "synonyms": [ "albert => albert, al", "allan => allan, al" ] } } original query: Allan Smith after synonyms: [allan, al] smith original query: Albert Smith after synonyms: [albert, al] smith

Slide 213

Slide 213 text

• You can set up synonyms at indexing or at query time. • For all that’s beautiful in this world, do it at query time. • At indexing explodes your data size. • You can store synonyms in a file, and reference that file in your mapping. • Many gotchas. • Undocumented limits on the file. • Needs to be uploaded to the config dir on each node. synonym gotchas

Slide 214

Slide 214 text

• Use to suggest possible search terms, or complete queries • Types: • Term and Phrase - will do spelling corrections • Completion - for autocomplete • Context - limit suggestions to a subset suggesters

Slide 215

Slide 215 text

• Why? Hook your query up to JS and query-as-they-type • Completion suggester (faster, newer, slightly cumbersome) • Prefix queries (slower, older, more reliable) • Both require mapping changes to work autocompletion

Slide 216

Slide 216 text

curl -X POST 'localhost:9200/test/books/_suggest?pretty' -d '{ "title-suggest" : { "text" : "p", "completion" : { "field" : "suggest" } } }' suggester autocompletion

Slide 217

Slide 217 text

curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ '{ "query": { "prefix": { "title": "P" } } }' prefix autocompletion

Slide 218

Slide 218 text

thank you

Slide 219

Slide 219 text

thank you @ErikRose [email protected] @lxt [email protected]