Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch for Centralized Logging, Fulltext Search, and NoSQL

Martin Smith
November 19, 2014
160

Elasticsearch for Centralized Logging, Fulltext Search, and NoSQL

Martin Smith

November 19, 2014
Tweet

Transcript

  1. Elasticsearch for Centralized
    Logging, Fulltext Search,
    and NoSQL
    [email protected]

    View Slide

  2. What is Elasticsearch?
    • search server based on Lucene
    • schemaless document store (JSON in, JSON out)
    • RESTful HTTP interface on :9200
    • full-text search engine
    • geographical search of many kinds
    • faceting engine NoSQL database
    • developed in Java, Apache License 2.0

    View Slide

  3. Starting with concepts and terms, modeling data***
    Elasticsearch Relational model
    Index** Database
    Document type Schema for a table
    Document Row in table, like a hash table
    Field Column
    Document ID / _id Primary key
    Filter, Query SQL Select
    Shard, Replica Partitioned table, Replication?
    ** index is also used as a verb, to index a document. This is equivalent to an INSERT OR UPDATE statement in an RDBMS.
    *** Strings, numerics, geographical coordinates, attachments, arrays, subdocuments, nested docs

    View Slide

  4. Recommended architecture for ES
    Reserve memory, about 50% of available RAM (not all as JVM heap, either)
    Pin pages in memory to avoid swapping with Mlockall
    Create separate nodes for roles like master, data, query (HTTP load balancing)
    Tune settings for quorum & recovery (e.g. minimum master nodes)
    Pay attention to document locality (route to same shard)
    Monitor memory usage or face catastrophic failure
    Secure access to the http interface (and port :9300, the transport client!)
    Scale by common patterns like per user or per time period
    vs

    View Slide

  5. Starting with full text search
    • Analysis: Breaking apart data to make it more searchable
    • Faceting: representing the data in more than one way (_source, exact,
    tokenized)
    • Two major types of search (usually combined for speed):
    • Filters (Fast, cacheable) with boolean results - [“type”: “string”, "index”:
    “not_analyzed”]
    • Queries (Slow, not cacheable) with fuzzy scoring
    Example of a the ‘snowball’ analyzer:
    GET /_analyze?analyzer=snowball&text=gator%20linux%
    20users%20group%20is%20awesome
    Output: gator, linux, user, group, awesom

    View Slide

  6. Example: Creating an index
    $ curl -XPUT 'http://localhost:9200/twitter/'
    $ curl -XPUT 'http://localhost:9200/twitter/' -d '
    index :
    number_of_shards : 3
    number_of_replicas : 2
    '
    (That’s right, YAML. This is also where mappings might go.)

    View Slide

  7. Example: Indexing a document (could be create or update)
    $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d
    '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
    }'
    Output:
    {
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1",
    "_version" : 1,
    "created" : true
    }

    View Slide

  8. Example: Fetch a document, delete is almost identical (-
    XDELETE)
    $ curl -XGET 'http://localhost:9200/twitter/tweet/1'
    {
    "_index" : "twitter",
    "_type" : "tweet",
    "_id" : "1",
    "_version" : 1,
    "found": true,
    "_source" : {
    "user" : "kimchy",
    "postDate" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
    }
    }

    View Slide

  9. Example: mapping example
    {
    "product": {
    "properties": {
    "ProductId": { "type": "string", "index": "not_analyzed" },
    "ProductEnabled": { "type": "boolean" },
    "PiecesIncluded": { "type": "long" },
    "LastModified": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss.
    SSS" },
    "AvailableInventory": { "type": "float" },
    "Price": { "type": "float" },
    "LongDescription": { "type": "string", "include_in_all" : true },
    "ProductName" : {
    "type" : "multi_field",
    "include_in_all" : true,
    "fields" : {
    "ProductName": { "type": "string", "index": "not_analyzed" },
    "lowercase": { "type": "string", "analyzer": "lowercase_analyzer"
    },
    "suggest" : { "type": "string", "analyzer": "suggest_analyzer" }
    }
    }
    }
    }
    }

    View Slide

  10. Example: Search with Query as query string
    For example, we can search on all documents across all types within the twitter index:
    $ curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy'
    We can also search within specific types:
    $ curl -XGET 'http://localhost:9200/twitter/tweet,user/_search?
    q=user:kimchy'
    We can also search all tweets with a certain tag across several indices (for example, when each user has
    his own index):
    $ curl -XGET 'http://localhost:9200/kimchy,
    elasticsearch/tweet/_search?q=tag:wow'
    Or we can search all tweets across all available indices using _all placeholder:
    $ curl - XGET 'http://localhost:9200/_all/tweet/_search?q=tag:wow'
    Or even search across all indices and all types:
    $ curl -XGET 'http://localhost:9200/_search?q=tag:wow'
    www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-
    search.html

    View Slide

  11. What else does it get used for?
    • NoSQL databases: if you can deal without a few features
    • Logs and Statistics: projects like logstash and kibana, github uses it for
    exception tracking
    • Fast Visualizations: really anything where you want realtime filtering
    • make anything searchable…
    The ‘unique holy triangle,’ in a single product: “data exploration capabilities,
    unstructured search, structured search, and aggregations or analytics.“ It can
    even accept a query first, and notify you of new search results once new
    documents are indexed.

    View Slide

  12. Living in the trenches with ES
    • Project maturity: there has been some criticism of the
    documentation’s ‘findability’ and new tools and libraries are still
    emerging every week
    • Schemas still matter: eventually, customers will want to do custom
    ‘mappings’ and use custom routing to drive data to particular shards,
    use aliases for those shards
    • Typical Java tuning: garbage collection, locking/threading, I/O, and
    algorithm issues
    • Make good choices: 1 index with 50 shards should perform the same
    as 50 indices with 1 shard, but the common patterns are index per
    user or index per time unit.

    View Slide

  13. What is Logstash?
    • collect, process, and forward events (logs)
    • JSON-based configuration
    • Inputs > Codecs > Filters > Outputs
    • Inputs for files, sockets, syslog, irc, gelf,
    irc, twitter, graphite, heroku, imap, jmx
    • Codecs for json, collectd, graphite, plain
    • Filters for dates, location, urldecode,
    mutate, grok, geoip
    • Ouputs for Elasticsearch, MQs, Nagios,
    Databases, and many, many more

    View Slide

  14. Most basic logstash config
    input { stdin { } }
    input {
    tcp {
    host => localhost
    port => 1234
    }
    }
    filter {
    grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    }
    output {
    elasticsearch { host => "192.168.10.1" }
    stdout { codec => rubydebug }
    }

    View Slide

  15. Logstash magic
    - Elasticsearch output, transport proto.
    - Template for how to index, defaults to
    ‘logstash-`
    - Combine with curator to tend time
    series data, clean up by date or size
    - Alternatives:
    logstash-forwarder with lumberjack

    View Slide

  16. What is Kibana?
    • browser based analytics and search interface for
    Elasticsearch that was developed primarily to view
    Logstash event data
    • Understands timeseries and compares time series
    • Dashboards and charts with drill-down functionality
    • Kibana 3 talks directly to ES, Kibana 4 (Beta) proxies
    Examples here, here, and here

    View Slide

  17. Live demo!!

    View Slide

  18. How can you learn more?
    • Try tutorials: see links below.
    • Plug it in: with projects like Drupal, Magento, WordPress
    • make anything searchable… libraries for Python, Ruby, Java, Scala, Node...
    http://joelabrahamsson.com/elasticsearch-101/
    http://people.mozilla.org/~wkahngreene/elastic/index.html
    http://www.elasticsearch.org/guide/
    https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
    https://www.youtube.com/watch?v=lWKEphKIG8U
    http://www.slideshare.net/aszegedi/everything-i-ever-learned-about-jvm-performance-tuning-twitter
    Questions?

    View Slide