schemaless document store (JSON in, JSON out) • RESTful HTTP interface on :9200 • full-text search engine • geographical search of many kinds • faceting engine NoSQL database • developed in Java, Apache License 2.0
Index** Database Document type Schema for a table Document Row in table, like a hash table Field Column Document ID / _id Primary key Filter, Query SQL Select Shard, Replica Partitioned table, Replication? ** index is also used as a verb, to index a document. This is equivalent to an INSERT OR UPDATE statement in an RDBMS. *** Strings, numerics, geographical coordinates, attachments, arrays, subdocuments, nested docs
RAM (not all as JVM heap, either) Pin pages in memory to avoid swapping with Mlockall Create separate nodes for roles like master, data, query (HTTP load balancing) Tune settings for quorum & recovery (e.g. minimum master nodes) Pay attention to document locality (route to same shard) Monitor memory usage or face catastrophic failure Secure access to the http interface (and port :9300, the transport client!) Scale by common patterns like per user or per time period vs
to make it more searchable • Faceting: representing the data in more than one way (_source, exact, tokenized) • Two major types of search (usually combined for speed): • Filters (Fast, cacheable) with boolean results - [“type”: “string”, "index”: “not_analyzed”] • Queries (Slow, not cacheable) with fuzzy scoring Example of a the ‘snowball’ analyzer: GET /_analyze?analyzer=snowball&text=gator%20linux% 20users%20group%20is%20awesome Output: gator, linux, user, group, awesom
-XPUT 'http://localhost:9200/twitter/' -d ' index : number_of_shards : 3 number_of_replicas : 2 ' (That’s right, YAML. This is also where mappings might go.)
can search on all documents across all types within the twitter index: $ curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy' We can also search within specific types: $ curl -XGET 'http://localhost:9200/twitter/tweet,user/_search? q=user:kimchy' We can also search all tweets with a certain tag across several indices (for example, when each user has his own index): $ curl -XGET 'http://localhost:9200/kimchy, elasticsearch/tweet/_search?q=tag:wow' Or we can search all tweets across all available indices using _all placeholder: $ curl - XGET 'http://localhost:9200/_all/tweet/_search?q=tag:wow' Or even search across all indices and all types: $ curl -XGET 'http://localhost:9200/_search?q=tag:wow' www.elasticsearch.org/guide/en/elasticsearch/reference/current/search- search.html
if you can deal without a few features • Logs and Statistics: projects like logstash and kibana, github uses it for exception tracking • Fast Visualizations: really anything where you want realtime filtering • make anything searchable… The ‘unique holy triangle,’ in a single product: “data exploration capabilities, unstructured search, structured search, and aggregations or analytics.“ It can even accept a query first, and notify you of new search results once new documents are indexed.
has been some criticism of the documentation’s ‘findability’ and new tools and libraries are still emerging every week • Schemas still matter: eventually, customers will want to do custom ‘mappings’ and use custom routing to drive data to particular shards, use aliases for those shards • Typical Java tuning: garbage collection, locking/threading, I/O, and algorithm issues • Make good choices: 1 index with 50 shards should perform the same as 50 indices with 1 shard, but the common patterns are index per user or index per time unit.
how to index, defaults to ‘logstash-<date>` - Combine with curator to tend time series data, clean up by date or size - Alternatives: logstash-forwarder with lumberjack
for Elasticsearch that was developed primarily to view Logstash event data • Understands timeseries and compares time series • Dashboards and charts with drill-down functionality • Kibana 3 talks directly to ES, Kibana 4 (Beta) proxies Examples here, here, and here