Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch for Centralized Logging, Fulltext Search, and NoSQL

Martin Smith
November 19, 2014
210

Elasticsearch for Centralized Logging, Fulltext Search, and NoSQL

Martin Smith

November 19, 2014
Tweet

Transcript

  1. What is Elasticsearch? • search server based on Lucene •

    schemaless document store (JSON in, JSON out) • RESTful HTTP interface on :9200 • full-text search engine • geographical search of many kinds • faceting engine NoSQL database • developed in Java, Apache License 2.0
  2. Starting with concepts and terms, modeling data*** Elasticsearch Relational model

    Index** Database Document type Schema for a table Document Row in table, like a hash table Field Column Document ID / _id Primary key Filter, Query SQL Select Shard, Replica Partitioned table, Replication? ** index is also used as a verb, to index a document. This is equivalent to an INSERT OR UPDATE statement in an RDBMS. *** Strings, numerics, geographical coordinates, attachments, arrays, subdocuments, nested docs
  3. Recommended architecture for ES Reserve memory, about 50% of available

    RAM (not all as JVM heap, either) Pin pages in memory to avoid swapping with Mlockall Create separate nodes for roles like master, data, query (HTTP load balancing) Tune settings for quorum & recovery (e.g. minimum master nodes) Pay attention to document locality (route to same shard) Monitor memory usage or face catastrophic failure Secure access to the http interface (and port :9300, the transport client!) Scale by common patterns like per user or per time period vs
  4. Starting with full text search • Analysis: Breaking apart data

    to make it more searchable • Faceting: representing the data in more than one way (_source, exact, tokenized) • Two major types of search (usually combined for speed): • Filters (Fast, cacheable) with boolean results - [“type”: “string”, "index”: “not_analyzed”] • Queries (Slow, not cacheable) with fuzzy scoring Example of a the ‘snowball’ analyzer: GET /_analyze?analyzer=snowball&text=gator%20linux% 20users%20group%20is%20awesome Output: gator, linux, user, group, awesom
  5. Example: Creating an index $ curl -XPUT 'http://localhost:9200/twitter/' $ curl

    -XPUT 'http://localhost:9200/twitter/' -d ' index : number_of_shards : 3 number_of_replicas : 2 ' (That’s right, YAML. This is also where mappings might go.)
  6. Example: Indexing a document (could be create or update) $

    curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }' Output: { "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_version" : 1, "created" : true }
  7. Example: Fetch a document, delete is almost identical (- XDELETE)

    $ curl -XGET 'http://localhost:9200/twitter/tweet/1' { "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_version" : 1, "found": true, "_source" : { "user" : "kimchy", "postDate" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" } }
  8. Example: mapping example { "product": { "properties": { "ProductId": {

    "type": "string", "index": "not_analyzed" }, "ProductEnabled": { "type": "boolean" }, "PiecesIncluded": { "type": "long" }, "LastModified": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss. SSS" }, "AvailableInventory": { "type": "float" }, "Price": { "type": "float" }, "LongDescription": { "type": "string", "include_in_all" : true }, "ProductName" : { "type" : "multi_field", "include_in_all" : true, "fields" : { "ProductName": { "type": "string", "index": "not_analyzed" }, "lowercase": { "type": "string", "analyzer": "lowercase_analyzer" }, "suggest" : { "type": "string", "analyzer": "suggest_analyzer" } } } } } }
  9. Example: Search with Query as query string For example, we

    can search on all documents across all types within the twitter index: $ curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy' We can also search within specific types: $ curl -XGET 'http://localhost:9200/twitter/tweet,user/_search? q=user:kimchy' We can also search all tweets with a certain tag across several indices (for example, when each user has his own index): $ curl -XGET 'http://localhost:9200/kimchy, elasticsearch/tweet/_search?q=tag:wow' Or we can search all tweets across all available indices using _all placeholder: $ curl - XGET 'http://localhost:9200/_all/tweet/_search?q=tag:wow' Or even search across all indices and all types: $ curl -XGET 'http://localhost:9200/_search?q=tag:wow' www.elasticsearch.org/guide/en/elasticsearch/reference/current/search- search.html
  10. What else does it get used for? • NoSQL databases:

    if you can deal without a few features • Logs and Statistics: projects like logstash and kibana, github uses it for exception tracking • Fast Visualizations: really anything where you want realtime filtering • make anything searchable… The ‘unique holy triangle,’ in a single product: “data exploration capabilities, unstructured search, structured search, and aggregations or analytics.“ It can even accept a query first, and notify you of new search results once new documents are indexed.
  11. Living in the trenches with ES • Project maturity: there

    has been some criticism of the documentation’s ‘findability’ and new tools and libraries are still emerging every week • Schemas still matter: eventually, customers will want to do custom ‘mappings’ and use custom routing to drive data to particular shards, use aliases for those shards • Typical Java tuning: garbage collection, locking/threading, I/O, and algorithm issues • Make good choices: 1 index with 50 shards should perform the same as 50 indices with 1 shard, but the common patterns are index per user or index per time unit.
  12. What is Logstash? • collect, process, and forward events (logs)

    • JSON-based configuration • Inputs > Codecs > Filters > Outputs • Inputs for files, sockets, syslog, irc, gelf, irc, twitter, graphite, heroku, imap, jmx • Codecs for json, collectd, graphite, plain • Filters for dates, location, urldecode, mutate, grok, geoip • Ouputs for Elasticsearch, MQs, Nagios, Databases, and many, many more
  13. Most basic logstash config input { stdin { } }

    input { tcp { host => localhost port => 1234 } } filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } date { match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ] } } output { elasticsearch { host => "192.168.10.1" } stdout { codec => rubydebug } }
  14. Logstash magic - Elasticsearch output, transport proto. - Template for

    how to index, defaults to ‘logstash-<date>` - Combine with curator to tend time series data, clean up by date or size - Alternatives: logstash-forwarder with lumberjack
  15. What is Kibana? • browser based analytics and search interface

    for Elasticsearch that was developed primarily to view Logstash event data • Understands timeseries and compares time series • Dashboards and charts with drill-down functionality • Kibana 3 talks directly to ES, Kibana 4 (Beta) proxies Examples here, here, and here
  16. How can you learn more? • Try tutorials: see links

    below. • Plug it in: with projects like Drupal, Magento, WordPress • make anything searchable… libraries for Python, Ruby, Java, Scala, Node... http://joelabrahamsson.com/elasticsearch-101/ http://people.mozilla.org/~wkahngreene/elastic/index.html http://www.elasticsearch.org/guide/ https://www.found.no/foundation/elasticsearch-from-the-bottom-up/ https://www.youtube.com/watch?v=lWKEphKIG8U http://www.slideshare.net/aszegedi/everything-i-ever-learned-about-jvm-performance-tuning-twitter Questions?