housekeeping • Make sure ES is installed. If you haven’t installed it yet and you’re on a Mac, just install 1.1.x. • Exercise code: clone the git repo at (or just visit) https://github.com/erikrose/oscon-elasticsearch/ • Make faces.
• CAP: consistency, availability, partition tolerance • “pick any two” ! • “When it comes to CAP, in a very high level, elasticsearch gives up on partition tolerance” (2010) CAP
• Generally not suitable as a primary data store. • It’s a distributed search engine ! • Easy to get started • Easy to integrate with your existing web app • Easy to configure it not-too-terribly • Enables fast search with cool features what it’s good for, redux
• node — a machine in your cluster • cluster — the set of nodes running ES • master node — Elected by the cluster. If the master fails, another node will take over. nodes and clusters
• shard — A Lucene index. Each piece of data you store is written to a primary shard. Primary shards are distributed over the cluster. ! • replica — Each shard has a set of distributed replicas (copies). Data written to a primary shard is copied to replicas on different nodes. shards and replicas
# Unicast discovery allows to explicitly control which nodes will be used # to discover the cluster. It can be used when multicast is not present, # or to restrict the cluster communication-wise. # # 1. Disable multicast discovery (enabled by default): # discovery.zen.ping.multicast.enabled: false exercise: fix clustering and listening # Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens # on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node # communication. (the range means that if the port is busy, it will automatically # try the next port). ! # Set the bind address specifically (IPv4 or IPv6): # network.bind_host: 127.0.0.1
indices on properties "title": "All About Fish", "author": "Fishy McFishstein", "pages": 3015 "title": "Nothing About Pigs", "author": "Nopiggy Nopigman", "pages": 0 "title": "All About Everything", "author": "Everybody", "pages": 4294967295
1. Delete the “album” doctype, if you’ve made one by following along. 2. Think of an album which would prompt ES to guess a wrong type. 3. Insert it, and GET the _mapping to show the wrong guess. 4. Delete all “album” docs again so you can change the mapping. 5. Set a mapping explicitly so you can’t fool ES anymore. exercise: use explicit mappings
• Query ES via HTTP/REST • Possible to do with query string • DSL is better ! • Let’s write some queries. • But first, let’s get some data in our cluster to query. queries
exercise 1 • Bulk load a small test data set to use for querying. • This is exercise_1 in the queries/ directory of the git repo, so you can cut and paste, or execute it directly. ! % curl -XPOST localhost:9200/_bulk --data-binary @data.bulk
! ! % curl -s -XGET 'http://127.0.0.1:9200/test/book/1?pretty' exercise 2 • Let’s check we can pull that data, by grabbing a single document. ! • This is exercise_2 in the queries/ directory of the repo, so you can cut and paste.
exercise 3 • We’ll begin by using a URI search (sometimes called, a little fuzzily, a query string query). ! • (This is exercise_3) ! % curl -s -XGET 'http://127.0.0.1:9200/test/book/_search?q=title:Python'
• Passes searches via GET in the query string • This is fine for running simple queries, basic “is it working” type tests and so on. • Once you have any level of complexity in your query, you’re going to need the query DSL. ! limited appeal
• DSL == Domain Specific Language • DSL is an AST (abstract syntax tree) of queries. ! • What does that actually mean? • Write your queries in JSON, which can be arbitrarily complex. query DSL
• This is a classic beginner gotcha. • Using the standard analyzer, applied to all fields (by default) “Web Development” will be broken into the terms “web” and “development” and those will be indexed. ! • The term “Web Development” is not indexed anywhere. analyze that!
• term queries or filters look for an exact match, so find nothing ! • But {“match” : “Web Development”} does work. Why? • match queries or filters use analysis: they break this down into searches for “web” or “development” but match works!
• Term queries look for the whole term and are not analyzed. • Match queries are analyzed, and look for matches to the analyzed parts of the query. summary: term vs. match
• Where are my favorites, AND, OR, and NOT? • Tortured syntax of the bool query: • must: everything in the must clause is AND • should: everything in the should clause is OR • should not: you guessed it. • Nest them as much as you like boolean queries
• Aggregations let you put returned documents into buckets and run metrics over those buckets. • Useful for drill down navigation of data. aggregations
• You can affect the way ES calculates relevance scores for results. For example: • Boost: weigh one part of a query more heavily than others • Custom function-scoring queries: e.g. weighting more complete user profiles • Constant score queries: pre-set a score for part of a query (useful for filters!) scoring
• You have various options for writing your functions: • Default has been mvel but is now Groovy • Plugins for: • JS • Python • Clojure • mvel scripting languages
stock analyzers original: Red-orange gerbils live at #43A Franklin St. ! whitespace: Red-orange gerbils live at #43A Franklin St. standard: red orange gerbils live 43a franklin st simple: red orange gerbils live at a franklin st stop: red orange gerbils live franklin st snowball: red orang gerbil live 43a franklin st • stopwords • stemming • punctuation • case-folding
exercise: find 10 stopwords curl -XGET -s 'http://localhost:9200/_analyze? analyzer=stop&pretty=true' -d 'The word "an" is a stopword.' Hint: Run the above and see what happens.
exercise: provisioning How would you provision a cluster if we were doing lots of CPU- expensive queries on a large corpus, but only a small subset of the corpus was “hot”?
• At least 1 replica • Plenty of shards—but not a million • At least 3 nodes. recommendations Avoid split-brain: discovery.zen.minimum_master_nodes: 2 • Get unlucky? Set fire to the data center and walk away. Or continually repopulate.
• Protect with a firewall, or try elasticsearch-jetty. • discovery.zen.ping.multicast.enabled: false • discovery.zen.ping.unicast.hosts: [“master1”, “master2”] • cluster.name: something_weird too friendly
adding nodes without downtime • Puppet out new config file: discovery.zen.ping.unicast.hosts: ["old.example.com", ..., "new.example.com"] • Bring up the new node.
• Backups used to be fairly cumbersome but now there’s an API for that! ! • Set it up: curl -XPUT 'http://localhost:9200/_snapshot/backups' -d '{! "type": "fs",! "settings": {! "location": "/somewhere/backups",! "compress": true! }! }'! ! • Run a backup: curl -XPUT "localhost:9200/_snapshot/backups/july20" backups
synonyms "filter": { "synonym": { "type": "synonym", "synonyms": [ "albert => albert, al", "allan => allan, al" ] } } original query: Allan Smith after synonyms: [allan, al] smith original query: Albert Smith after synonyms: [albert, al] smith
• You can set up synonyms at indexing or at query time. • For all that’s beautiful in this world, do it at query time. • At indexing explodes your data size. • You can store synonyms in a file, and reference that file in your mapping. • Many gotchas. • Undocumented limits on the file. • Needs to be uploaded to the config dir on each node. synonym gotchas
• Use to suggest possible search terms, or complete queries • Types: • Term and Phrase - will do spelling corrections • Completion - for autocomplete • Context - limit suggestions to a subset suggesters
• Why? Hook your query up to JS and query-as-they-type ! • Completion suggester (faster, newer, slightly cumbersome) • Prefix queries (slower, older, more reliable) ! • Both require mapping changes to work autocompletion