$30 off During Our Annual Pro Sale. View Details »

ElasticSearch: The Missing Intro

ElasticSearch: The Missing Intro

ElasticSearch tutorial for OSCON 2014.

Laura Thomson

July 20, 2014
Tweet

More Decks by Laura Thomson

Other Decks in Technology

Transcript

  1. lastic
    the missing tutorial
    lastic
    Erik Rose & Laura Thomson
    Mozilla
    earch
    earch

    View Slide

  2. lastic
    the missing tutorial
    lastic
    Erik Rose & Laura Thomson
    Mozilla
    earch
    earch

    View Slide

  3. housekeeping
    • Make sure ES is installed. If you haven’t installed it yet and you’re
    on a Mac, just install 1.1.x.
    • Exercise code: clone the git repo at (or just visit)

    https://github.com/erikrose/oscon-elasticsearch/
    • Make faces.

    View Slide

  4. • Full-text search
    • Big data
    • Faceting
    • Geographical queries
    what it’s good for

    View Slide

  5. Shay Banon,
    Heavy Lifter

    View Slide

  6. the rest of us
    ?

    View Slide

  7. characteristics

    View Slide

  8. • Elasticsearch wraps Lucene.
    • Read/write/admin via REST
    • Native format is JSON (vs XML).
    lucene++
    JSON
    HTTP
    on port 9200

    View Slide

  9. • CAP: consistency, availability, partition tolerance
    • “pick any two”
    !
    • “When it comes to CAP, in a very high level, elasticsearch gives up
    on partition tolerance” (2010)
    CAP

    View Slide

  10. • …it’s not that simple
    !
    • Consistency is mostly eventual.
    • Availability is variable.
    • Partition tolerant it’s not.
    !
    • Read http://aphyr.com/posts/317-call-me-maybe-elasticsearch
    (and despair).
    CAP

    View Slide

  11. • Generally not suitable as a primary data store.
    • It’s a distributed search engine
    !
    • Easy to get started
    • Easy to integrate with your existing web app
    • Easy to configure it not-too-terribly
    • Enables fast search with cool features
    what it’s good for, redux

    View Slide

  12. definitions

    View Slide

  13. • node — a machine in your cluster
    • cluster — the set of nodes running ES
    • master node — Elected by the cluster. If the master fails, another
    node will take over.
    nodes and clusters

    View Slide

  14. • shard — A Lucene index. Each piece of data you store is written to a
    primary shard. Primary shards are distributed over the cluster.
    !
    • replica — Each shard has a set of distributed replicas (copies). Data
    written to a primary shard is copied to replicas on different nodes.
    shards and replicas

    View Slide

  15. self-defense

    View Slide

  16. # Unicast discovery allows to explicitly control which nodes will be used
    # to discover the cluster. It can be used when multicast is not present,
    # or to restrict the cluster communication-wise.
    #
    # 1. Disable multicast discovery (enabled by default):
    #
    discovery.zen.ping.multicast.enabled: false
    exercise: fix clustering and listening
    # Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens
    # on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node
    # communication. (the range means that if the port is busy, it will automatically
    # try the next port).
    !
    # Set the bind address specifically (IPv4 or IPv6):
    #
    network.bind_host: 127.0.0.1

    View Slide

  17. % cd elasticsearch-1.2.2
    !
    % bin/elasticsearch
    !
    # On the Mac:
    % JAVA_HOME=$(/usr/libexec/java_home -v 1.7) bin/elasticsearch
    exercise: start up and check
    % curl -s -XGET 'http://127.0.0.1:9200/_cluster/health?pretty'
    {
    "cluster_name" : "grinchertoo",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 19,
    "active_shards" : 19,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 13
    }

    View Slide

  18. exercise: tool up
    curl

    View Slide

  19. exercise: tool up
    BBEdit’s shell worksheets:

    http://pine.barebones.com/files/BBEdit_10.5.11.dmg

    View Slide

  20. exercise: tool up
    Marvel/Sense: http://www.elasticsearch.org/overview/marvel/download/

    View Slide

  21. data structure
    basics

    View Slide

  22. index
    doctype
    another doctype
    {…
    }

    View Slide

  23. curl -s -XPUT 'http://localhost:9200/test/'
    exercise: make an index

    View Slide

  24. IDs
    6a8ca01c-7896-48e9-!
    81cc-9f70661fcb32

    View Slide

  25. # Make a doc:

    curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{

    "title": "All About Fish",

    "author": "Fishy McFishstein",

    "pages": 3015

    }'
    !
    # Make sure it's there:

    curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty'
    {

    "_index" : "test",

    "_type" : "book",

    "_id" : "1",

    "_version" : 2,

    "found" : true,

    "_source" : {

    "title": "All About Fish",

    "author": "Fishy McFishstein",

    "pages": 3015

    }

    }
    exercise: make a doc

    View Slide

  26. # Delete the doc:
    curl -s -XDELETE 'http://localhost:9200/test/book/1'
    exercise: make a doc

    View Slide

  27. diplodocus …………………………… 333
    duodenum …………………………… 201
    dwaal …………………………… 500, 119

    View Slide

  28. row → 0,1,3
    boat → 0,1
    chicken → 2
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3

    View Slide

  29. row → 0,1,3
    boat → 0,1
    chicken → 2
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3

    View Slide

  30. row → 0,1,3
    boat → 0,1
    chicken → 2
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3

    View Slide

  31. row → 0,1,3
    boat → 0,1
    chicken → 2
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3

    View Slide

  32. row → 0,1,3
    boat → 0,1
    chicken → 2
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3

    View Slide

  33. doc
    row → 0 [0,1,2]
    1 [0,2]
    3 [2]
    boat → 0 [4]
    1 [3]
    chicken → 2 [0,1,2]
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3
    positions

    View Slide

  34. doc positions
    row → 0 [0,1,2]
    1 [0,2]
    3 [2]
    boat → 0 [4]
    1 [3]
    chicken → 2 [0,1,2]
    row row
    row your
    boat
    row the
    row boat
    0 1
    chicken
    chicken
    chicken
    the
    front
    row
    2 3

    View Slide

  35. doc positions
    row → 0 [0,1,2]
    1 [0,2]
    3 [2]
    boat → 0 [4]
    1 [3]
    chicken → 2 [0,1,2]
    row row
    row your
    boat
    row the
    row boat
    0 1
    chicken
    chicken
    chicken
    the
    front
    row
    2 3
    ?

    View Slide

  36. doc positions
    row → 0 [0,1,2]
    1 [0,2]
    3 [2]
    boat → 0 [4]
    1 [3]
    chicken → 2 [0,1,2]
    row row
    row your
    boat
    row the
    row boat
    0 1
    chicken
    chicken
    chicken
    the
    front
    row
    2 3
    ?

    View Slide

  37. doc
    row 232 → 0 [0,1,2]
    1 [0,2]
    3 [2]
    boat 78 → 0 [4]
    1 [3]
    chicken 91 → 2 [0,1,2]
    row row
    row your
    boat
    row the
    row boat
    chicken
    chicken
    chicken
    the
    front
    row
    0 1 2 3
    positions

    View Slide

  38. indices on properties
    "title": "All About Fish",
    "author": "Fishy McFishstein",
    "pages": 3015
    "title": "Nothing About Pigs",
    "author": "Nopiggy Nopigman",
    "pages": 0
    "title": "All About Everything",
    "author": "Everybody",
    "pages": 4294967295

    View Slide

  39. inner objects
    curl -s -XPUT 'http://localhost:9200/test/book/1' -d '{
    "title": "All About Fish",
    "author": {
    "name": "Fisher McFishstein",
    "birthday": "1980-02-22",
    "favorite_color": "green"
    }
    }'
    title: All About Fish
    author.name: Fisher McFishstein
    author.birthday: 1980-02-22
    author.favorite_color: green
    curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty'
    {
    "_index" : "test",
    "_type" : "book",
    "_id" : "1",
    "_version" : 1,
    "found" : true,
    "_source" : {
    "title": "All About Fish",
    "author": {
    "name": "Fisher McFishstein",
    "birthday": "1980-02-22",
    "favorite_color": "green"
    }
    }

    View Slide

  40. arrays
    # Insert a doc containing an array:
    curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{
    "title": "All About Fish",
    "tag": ["one", "two", "red", "blue"]
    }'
    doc
    one → 1
    two → 1
    red → 1
    blue → 1
    ["one",
    "two",
    "red",
    "blue"]
    doc 1

    View Slide

  41. # Insert a bunch of different docs by changing the things in bold:
    % curl -s XPUT 'http://127.0.0.1:9200/test/book/1' -d '{
    "title": "All About Fish",
    "tag": ["one", "two", "red", "blue"]
    }'
    exercise: array play
    # A sample query--try changing the bold things:
    % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?pretty' -d '{
    "query": {
    "match_all": {}
    },
    "filter": {
    "term": {"tag": ["two", "three"]}
    }
    }'
    "red"
    ["blue"]
    ["one", "red"]
    "two"

    View Slide

  42. mappings

    View Slide

  43. # Make a new album doc:

    curl -s XPUT 'http://127.0.0.1:9200/test/
    album/1' -d '{
    "title": "Fish Sounds",
    "gapless_playback": true,
    "length_seconds": 210000,
    "weight": 1.22,
    "released": "2013-01-23"

    }'
    !
    # See what kind of mapping ES guessed:

    curl -s -XGET 'http://127.0.0.1:9200/test/
    album/_mapping?pretty'

    implicit mappings
    {
    "test" : {
    "mappings" : {
    "album" : {
    "properties" : {
    "title" : {
    "type" : "string"
    },
    "gapless_playback" : {
    "type" : "boolean"
    },
    "length_seconds" : {
    "type" : "long"
    },
    "weight" : {
    "type" : "double"
    },
    "released" : {
    "type" : "date",
    "format" : "dateOptionalTime"
    }
    }
    }
    }
    }

    View Slide

  44. explicit mappings
    {
    "test" : {
    "mappings" : {
    "album" : {
    "properties" : {
    "title" : {
    "type" : "string"
    },
    "gapless_playback" : {
    "type" : "boolean"
    },
    "length_seconds" : {
    "type" : "long"
    },
    "weight" : {
    "type" : "double"
    },
    "released" : {
    "type" : "date",
    "format" : "dateOptionalTime"
    }
    }
    }
    }
    }
    curl -s XPUT 'http://127.0.0.1:9200/test/
    _mapping/album' -d '{
    "properties" : {
    "title" : {
    "type" : "string"
    },
    "gapless_playback" : {
    "type" : "boolean"
    },
    "length_seconds" : {
    "type" : "long"
    },
    "weight" : {
    "type" : "double"
    },
    "released" : {
    "type" : "date",
    "format" : "dateOptionalTime"
    }
    }
    }'
    {
    curl -s -XDELETE 'http://127.0.0.1:9200/
    test/album'

    View Slide

  45. 1. Delete the “album” doctype, if you’ve made one by following
    along.
    2. Think of an album which would prompt ES to guess a wrong type.
    3. Insert it, and GET the _mapping to show the wrong guess.
    4. Delete all “album” docs again so you can change the mapping.
    5. Set a mapping explicitly so you can’t fool ES anymore.
    exercise: use explicit mappings

    View Slide

  46. Lurking Horrors

    View Slide

  47. queries

    View Slide

  48. • Query ES via HTTP/REST
    • Possible to do with query string
    • DSL is better
    !
    • Let’s write some queries.
    • But first, let’s get some data in our cluster to query.
    queries

    View Slide

  49. exercise 1
    • Bulk load a small test data set to use for querying.
    • This is exercise_1 in the queries/ directory of the git repo, so you
    can cut and paste, or execute it directly.
    !
    % curl -XPOST localhost:9200/_bulk --data-binary @data.bulk

    View Slide

  50. !
    !
    % curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty'
    exercise 2
    • Let’s check we can pull that data, by grabbing a single document.
    !
    • This is exercise_2 in the queries/ directory of the repo, so you can
    cut and paste.

    View Slide

  51. exercise 3
    • We’ll begin by using a URI search (sometimes called, a little fuzzily, a
    query string query).
    !
    • (This is exercise_3)
    !
    % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?q=title:Python'

    View Slide

  52. • Passes searches via GET in the query string
    • This is fine for running simple queries, basic “is it working” type
    tests and so on.
    • Once you have any level of complexity in your query, you’re going
    to need the query DSL.
    !
    limited appeal

    View Slide

  53. • DSL == Domain Specific Language
    • DSL is an AST (abstract syntax tree) of queries.
    !
    • What does that actually mean?
    • Write your queries in JSON, which can be arbitrarily complex.
    query DSL

    View Slide

  54. {
    "query" : {
    "match" : { "title" : "Python" }
    }
    }
    simple DSL term query

    View Slide

  55. • Run this query (exercise 4).
    !
    % curl -XGET 'http://localhost:9200/test/book/_search' -d '{
    "query" : {
    "match" : { "title" : "Python" }
    }
    }'
    !
    (What do you notice about the results?)
    exercise 4

    View Slide

  56. • Filters:
    • Boolean: document matches or it does not
    • Order of magnitude faster than queries
    • Use for exact values
    • Cacheable
    queries vs. filters

    View Slide

  57. • Queries:
    • Use for full text searches
    • Relevance scored
    !
    !
    !
    Filter when you can; query when you must.
    queries vs. filters

    View Slide

  58. curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \ !
    '{ !
    "query": { !
    "filtered": { !
    "filter": { !
    "term": { !
    "category": "Web Development" !
    } !
    }, !
    "query": { !
    "bool": { !
    "should": [ !
    { !
    "match": { !
    "title": "Python" !
    } !
    }, !
    { !
    "match": { !
    "summary": "Python" !
    } !
    } !
    ] !
    } !
    } !
    } !
    } !
    }'
    use them together!

    View Slide

  59. exercise 5
    • Let’s run that query.
    !
    • (This is exercise_5)

    View Slide

  60. exercise 5 results
    • Where are my results???

    View Slide

  61. exercise 6
    • Similar to many relational databases, ElasticSearch supports an
    explainer. Let’s run it on this query.
    • (This is exercise_6)
    !
    curl -XGET -s 'http://localhost:9200/test/book/4/_explain?pretty=true' -d \ !
    '{ !
    "query": { !
    "filtered": { !
    "filter": { !
    "term": { !
    "category": "Web Development" !
    } !
    }, !
    "query": { … !
    !
    !

    View Slide

  62. exercise 6 results
    {!
    "_index" : "test",!
    "_type" : "book",!
    "_id" : "4",!
    "matched" : false,!
    "explanation" : {!
    "value" : 0.0,!
    "description" : "failure to match filter: cache(category:Web
    Development)",!
    "details" : [ {!
    !
    …!
    !
    !
    !

    View Slide

  63. • This is a classic beginner gotcha.
    • Using the standard analyzer, applied to all fields (by default)
    “Web Development” will be broken into the terms “web” and
    “development” and those will be indexed.
    !
    • The term “Web Development” is not indexed anywhere.
    analyze that!

    View Slide

  64. • term queries or filters look for an exact match, so find nothing
    !
    • But {“match” : “Web Development”} does work. Why?
    • match queries or filters use analysis: they break this down into
    searches for “web” or “development”
    but match works!

    View Slide

  65. exercise 7
    • Let’s make it work.
    !
    • One solution is in exercise_7.
    • Take a couple minutes before peeking.
    • TMTOWTDI
    !
    !

    View Slide

  66. • Term queries look for the whole term and are not analyzed.
    • Match queries are analyzed, and look for matches to the analyzed
    parts of the query.
    summary: term vs. match

    View Slide

  67. curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \
    '{
    "query": {
    "match_phrase": {
    "summary": {
    "query": "old versions of browsers",
    "slop": 2
    }
    }
    }
    }'
    match_phrase

    View Slide

  68. • Where are my favorites, AND, OR, and NOT?
    • Tortured syntax of the bool query:
    • must: everything in the must clause is AND
    • should: everything in the should clause is OR
    • should not: you guessed it.
    • Nest them as much as you like
    boolean queries

    View Slide

  69. • minimum_should_match is the number of should clauses that
    have to match.
    boolean bonuses

    View Slide

  70. "query": {!
    "bool": {!
    "must": { !
    "bool": {!
    "should": [ !
    {!
    "match": {!
    "category": "development"!
    }!
    },!
    {!
    "match": { !
    "category": "programming" !
    }!
    }!
    ]!
    }!
    },!
    "should": [!
    {!
    "match": {…!

    View Slide

  71. exercise 8
    • Run this query - it’s in exercise_8
    !
    • Can you modify it to find books for intermediate or above level
    programmers?
    !
    !

    View Slide

  72. • We’re actually not going to cover faceting - deprecated in favor of
    aggregations.
    faceting

    View Slide

  73. • Aggregations let you put returned documents into buckets and
    run metrics over those buckets.
    • Useful for drill down navigation of data.
    aggregations

    View Slide

  74. exercise 9
    curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \!
    '{!
    "size" : 0,!
    "aggs" : {!
    "category" : {!
    "terms" : {!
    "field" : "category"!
    }!
    }!
    }!
    }'
    • Run a sample aggregation - exercise_9

    View Slide

  75. • You can affect the way ES calculates relevance scores for results.
    For example:
    • Boost: weigh one part of a query more heavily than others
    • Custom function-scoring queries: e.g. weighting more
    complete user profiles
    • Constant score queries: pre-set a score for part of a query
    (useful for filters!)
    scoring

    View Slide

  76. boosting
    "query": {
    "bool": {
    "should": [
    {
    "term": {
    "title": {
    "value": "python",
    "boost": 2.0
    }
    }
    },
    {
    "term": {
    "summary": "python"
    }
    }
    ]
    }
    }

    View Slide

  77. function scoring
    curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \
    '{
    "query": {
    "function_score": {
    "query": {
    "match": { "title": "Python" }
    },
    "script_score": {
    "script": "_score * doc[\"rating\"].value"
    }
    }
    }
    }'

    View Slide

  78. • You have various options for writing your functions:
    • Default has been mvel but is now Groovy
    • Plugins for:
    • JS
    • Python
    • Clojure
    • mvel
    scripting languages

    View Slide

  79. analysis

    View Slide

  80. stock analyzers
    original: Red-orange gerbils live at #43A Franklin St.
    !
    whitespace: Red-orange gerbils live at #43A Franklin St.
    standard: red orange gerbils live 43a franklin st
    simple: red orange gerbils live at a franklin st
    stop: red orange gerbils live franklin st
    snowball: red orang gerbil live 43a franklin st
    • stopwords
    • stemming
    • punctuation
    • case-folding

    View Slide

  81. curl -XGET -s 'http://localhost:9200/_analyze?
    analyzer=whitespace&pretty=true' -d 'Red-orange gerbils live at
    #43A Franklin St.'
    {
    "tokens" : [ {
    "token" : "Red-orange",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
    }, {
    "token" : "gerbils",
    "start_offset" : 11,
    "end_offset" : 18,
    "type" : "word",
    "position" : 2
    }, ...

    View Slide

  82. exercise: find 10 stopwords
    curl -XGET -s 'http://localhost:9200/_analyze?
    analyzer=stop&pretty=true' -d 'The word "an" is a stopword.'
    Hint: Run the above and see what happens.

    View Slide

  83. solution: find 10 stopwords
    curl -XGET -s 'http://localhost:9200/_analyze?
    analyzer=stop&pretty=true' -d 'The an is a with that be for to and
    snookums'
    {
    "tokens" : [ {
    "token" : "snookums",
    "start_offset" : 36,
    "end_offset" : 44,
    "type" : "word",
    "position" : 11
    } ]
    }
    [0,1,2]
    [0,2]
    [2]
    [4]
    [3]
    [0,1,2]
    positions

    View Slide

  84. applying mappings to properties
    curl -s XPUT 'http://127.0.0.1:9200/test/_mapping/album' -d '{
    "properties": {
    "title": {
    "type": "string"
    },
    "description": {
    "type": "string",
    "analyzer": "snowball"
    },
    ...
    }
    }'

    View Slide

  85. analyzer internals
    name_analyzer
    CharFilter Tokenizer
    Token
    Filter
    terms
    O Brien

    View Slide

  86. "analysis": {
    "analyzer": {
    "name_analyzer": {
    "type": "custom",
    "tokenizer": "name_tokenizer",
    "filter": ["lowercase"]
    }
    },
    "tokenizer": {
    "name_tokenizer": {
    "type": "pattern",
    "pattern": "[^a-zA-Z']+"
    }
    }
    }
    name_analyzer
    CharFilter Tokenizer
    Token
    Filter
    terms
    x
    O’Brien

    View Slide

  87. exercise: write a custom analyzer
    tags: "red, two-headed, striped, really dangerous"
    !
    curl -XGET -s 'http://localhost:9200/_analyze?analyzer=whitespace&pretty=true' -d
    'red, two-headed, striped, really dangerous'
    red two-headed striped really dangerous
    curl -s -XGET 'http://127.0.0.1:9200/test/
    monster/_search?pretty' -d '{
    "query": {
    "match_all": {}
    },
    "filter": {
    "term": {"tags": "dangerous"}
    }
    }
    {
    "took" : 3,
    "timed_out" : false,
    "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
    },
    "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
    "_index" : "test",
    "_type" : "monster",
    "_id" : "1",
    "_score" : 1.0, "_source" : {
    "title": "Scarlet Klackinblax",
    "tags": "red, two-headed, striped, really dangerous"
    }
    } ]
    }
    }

    View Slide

  88. exercise: write a custom analyzer
    # How to update the "test" index's analyzers:
    curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{
    "analysis": {
    "analyzer": {
    "whitespace_analyzer": {
    "filter": ["lowercase"],
    "tokenizer": "whitespace_tokenizer"
    }
    },
    "tokenizer": {
    "whitespace_tokenizer": {
    "type": "pattern",
    "pattern": " +"
    }
    }
    }
    }'
    curl -XGET -s 'http://localhost:9200/test/_analyze?
    analyzer=whitespace_analyzer&pretty=true' -d 'all your base are belong to us, dude'
    {
    "error" : "ElasticsearchIllegalArgumentException[Can't
    update non dynamic
    settings[[index.analysis.analyzer.comma_delim.filter.0,
    index.analysis.tokenizer.comma_delim_tokenizer.type,
    index.analysis.tokenizer.comma_delim_tokenizer.pattern,
    index.analysis.analyzer.comma_delim.tokenizer]] for open
    indices[[test]]]",
    "status" : 400
    }
    curl -s -XPOST 'http://localhost:9200/test/_close'
    curl -s -XPOST 'http://localhost:9200/test/_open'

    View Slide

  89. solution: write a custom analyzer
    curl -s -XPUT 'http://localhost:9200/test/_settings?pretty' -d '{
    "analysis": {
    "analyzer": {
    "comma_delim": {
    "filter": ["lowercase"],
    "tokenizer": "comma_delim_tokenizer"
    }
    },
    "tokenizer": {
    "comma_delim_tokenizer": {
    "type": "pattern",
    "pattern": ", +"
    }
    }
    }
    }'
    curl -XGET -s 'http://localhost:9200/test/_analyze?analyzer=comma_delim&pretty=true' -d 'red, two-
    headed, striped, really dangerous'
    "token": "red" ... "token": "two-headed" ... "token": "striped" ... "token": "really dangerous"

    View Slide

  90. ngrams
    'analyzer': {
    # A lowercase trigram analyzer
    'trigramalyzer': {
    'filter': ['lowercase'],
    'tokenizer': 'trigram_tokenizer'
    }
    },
    'tokenizer': {
    'trigram_tokenizer': {
    'type': 'nGram',
    'min_gram': 3,
    'max_gram': 3
    # Keeps all kinds of chars by default.
    }
    “Chemieingenieurwesen

    …ing
    nge
    gen
    eni
    nie
    ieu
    eur…

    View Slide

  91. clustering

    View Slide

  92. shards
    curl -XPUT 'http://localhost:9200/twitter/' -d '
    index:
    number_of_shards: 3
    '

    View Slide

  93. replicas
    curl -XPUT 'http://localhost:9200/twitter/' -d '
    index:
    number_of_shards: 3
    number_of_replicas: 2
    '

    View Slide

  94. exercise: provisioning
    How would you provision a cluster if we were doing lots of CPU-
    expensive queries on a large corpus, but only a small subset of the
    corpus was “hot”?

    View Slide

  95. extremer extremes

    View Slide

  96. • At least 1 replica
    • Plenty of shards—but not a million
    • At least 3 nodes.
    recommendations
    Avoid split-brain:
    discovery.zen.minimum_master_nodes: 2
    • Get unlucky?

    Set fire to the data center and walk away.
    Or continually repopulate.

    View Slide

  97. real-life examples

    View Slide

  98. • Protect with a firewall, or try elasticsearch-jetty.
    • discovery.zen.ping.multicast.enabled: false
    • discovery.zen.ping.unicast.hosts:

    [“master1”, “master2”]
    • cluster.name: something_weird
    too friendly

    View Slide

  99. adding nodes without downtime
    • Puppet out new config file:

    discovery.zen.ping.unicast.hosts:

    ["old.example.com", ..., "new.example.com"]
    • Bring up the new node.

    View Slide

  100. beware inconsistent config

    View Slide

  101. be wary of upgrades

    View Slide

  102. monitoring
    curl -XGET -s 'http://localhost:9200/_cluster/health?pretty'
    {
    "cluster_name" : "grinchertoo",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 29,
    "active_shards" : 29,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 26
    }
    curl -XGET -s 'http://localhost:9200/_cluster/state?pretty'
    {
    "cluster_name" : "elasticsearch",
    "version" : 3,
    "master_node" : "ACuIytIIQ7G7b_Rg_G7wnA",

    View Slide

  103. exercise: monitoring
    Why is just checking for cluster color insufficient?
    !
    What could we check in addition?
    "cluster_name" : "grinchertoo",
    "status" : "yellow",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 29,
    "active_shards" : 29,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 26

    View Slide

  104. monitoring: elasticsearch-paramedic
    http://karmi.github.com/elasticsearch-paramedic/

    View Slide

  105. monitoring: marvel
    http://www.elasticsearch.org/overview/marvel/

    View Slide

  106. optimization

    View Slide

  107. bootstrap.mlockall: true

    View Slide

  108. ES_HEAP_SIZE:
    half of RAM

    View Slide

  109. open files
    /etc/security/limits.conf:!
    es_user soft nofile 65535!
    es_user hard nofile 65535
    /etc/init.d/elasticsearch:!
    ulimit -n 65535!
    ulimit -l unlimited

    View Slide

  110. Use default stores.

    View Slide

  111. RAM & JVM tuning

    View Slide

  112. MySQL

    View Slide

  113. shrinking indices
    % vmstat -S m -a 2
    procs -----------memory---------- ---swap-- -----io----
    r b swpd free inact active si so bi bo
    1 0 4 37 54 55 0 0 0 1
    0 0 4 37 54 55 0 0 0 0
    0 0 4 37 54 55 0 0 0 0
    !
    "some_doctype" : {
    "_source" : {"enabled" : false}
    }
    "some_doctype" : {
    "_all" : {"enabled" : false}
    }
    "some_doctype" : {
    "some_field" : {"include_in_all" : false}
    }

    View Slide

  114. filter caching
    "filter": {
    "terms": {
    "tags": ["red", "green"],
    "execution": "plain"
    }
    }
    "filter": {
    "terms": {
    "tags": ["red", "green"],
    "execution": "bool"
    }
    }

    View Slide

  115. dealing with the future

    View Slide

  116. mappings

    View Slide

  117. expensive updates

    View Slide

  118. • Use Bulk API.
    how to reindex
    • Turn off auto-refresh:
    curl -XPUT localhost:9200/test/_settings -d '{
    "index" : {
    "refresh_interval" : "-1"
    }
    }'
    • index.merge.policy.merge_factor: 1000
    • Remove replicas if you can.
    • Use multiple feeder processes.
    • Put everything back.

    View Slide

  119. • Backups used to be fairly cumbersome but now there’s an API for that!
    !
    • Set it up:
    curl -XPUT 'http://localhost:9200/_snapshot/backups' -d '{!
    "type": "fs",!
    "settings": {!
    "location": "/somewhere/backups",!
    "compress": true!
    }!
    }'!
    !
    • Run a backup:
    curl -XPUT "localhost:9200/_snapshot/backups/july20"
    backups

    View Slide

  120. fancy & advanced features

    View Slide

  121. synonyms
    "filter": {
    "synonym": {
    "type": "synonym",
    "synonyms": [
    "albert => albert, al",
    "allan => allan, al"
    ]
    }
    }
    original query: Allan Smith
    after synonyms: [allan, al] smith
    original query: Albert Smith
    after synonyms: [albert, al] smith

    View Slide

  122. • You can set up synonyms at indexing or at query time.
    • For all that’s beautiful in this world, do it at query time.
    • At indexing explodes your data size.
    • You can store synonyms in a file, and reference that file in your
    mapping.
    • Many gotchas.
    • Undocumented limits on the file.
    • Needs to be uploaded to the config dir on each node.
    synonym gotchas

    View Slide

  123. • Use to suggest possible search terms, or complete queries
    • Types:
    • Term and Phrase - will do spelling corrections
    • Completion - for autocomplete
    • Context - limit suggestions to a subset
    suggesters

    View Slide

  124. • Why? Hook your query up to JS and query-as-they-type
    !
    • Completion suggester (faster, newer, slightly cumbersome)
    • Prefix queries (slower, older, more reliable)
    !
    • Both require mapping changes to work
    autocompletion

    View Slide

  125. !
    curl -X POST 'localhost:9200/test/books/_suggest?pretty' -d '{!
    "title-suggest" : {!
    "text" : "p",!
    "completion" : {!
    "field" : "suggest"!
    }!
    }!
    }'!
    suggester autocompletion

    View Slide

  126. curl -XGET -s 'http://localhost:9200/test/book/_search?pretty=true' -d \
    '{
    "query": {
    "prefix": { "title": "P" }
    }
    }'
    prefix autocompletion

    View Slide

  127. thank you
    @ErikRose
    [email protected]
    @lxt
    [email protected]

    View Slide