Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch: You Know, for s/Search/Operations/

Elasticsearch: You Know, for s/Search/Operations/

Elasticsearch is a popular solution for search and analytics engines. However, it can also serve as a powerful tool for operations teams to provide easy application monitoring, log collection, and self-serve dashboarding and analysis tools. In this presentation we'll cover some of these use cases, and how operations can provide the most reliable and performant service for stakeholders.

Tyler L

May 08, 2015
Tweet

More Decks by Tyler L

Other Decks in Technology

Transcript

  1. elasticsearch: you know, for
    s/search/operations/
    OpenWest 2015
    Tyler Langlois

    View Slide

  2. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    Speaker Bio
    ● Infrastructure Engineering @ Elastic
    ○ Previous: Qualtrics, Sandia National Laboratories, Blue Coat Systems, BYU
    ● Background in systems, security, *nix,
    smattering of different coding experience
    (scripting, web dev, devops)
    ● Happy as long as I’m automating things
    in a terminal
    ● Permanent mental bindings for vim and zsh
    leothrix
    tylerjl
    tjll.net
    Introduction

    View Slide

  3. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● Prelude: Wat is this?
    ● Why is this useful for Ops?
    ● How?
    ○ Architecture (hardware, net, etc.)
    ○ Security (subnets, REST, etc.)
    ○ Data in/Data out
    ● What Could Go Wrong?
    ● Q&A
    What We’ll Cover
    Introduction

    View Slide

  4. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● Scalable, fault-tolerant search and analytics engine
    ● Ideal for search, fits other cases excellently as well
    ● Open source, fast-moving, broad ecosystem
    ○ kopf, paramedic, marvel, bigdesk, client libraries, etc. etc…
    ○ Neat JS apps that run in browser and operate locally
    ● Has given rise to the ELK stack:
    ○ elasticsearch for storage
    ○ logstash for log/event processing
    ○ kibana for visualization
    Elasticsearch in a nutshell
    Prelude: What?

    View Slide

  5. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    What sets ES apart
    What sets ES apart as a search platform?
    ● Free as in beer and in speech
    ● Paired with logstash, nearly infinite inputs and outputs (and dead easy to
    extend)
    ● Some nice ES-specific features
    ○ geo mapping, percolator, tribe, etc.
    ● Flurry of developer interest, lots of tutorials/use cases circulating
    ● Aside: Different data processing paradigm (pre- vs. post-) & tradeoffs
    (reminder: ask me about this at the end if you’re interested)
    Prelude: What?

    View Slide

  6. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● ELK log analytics on the cheap (widely adopted, lots of development)
    ○ Generic logs from OS-level processes (/var/log/)
    ○ Application logs sent through message broker or other protocol
    ○ Hardware logs sent to syslog listeners
    ● Myriads of secondary use cases
    ○ Network analytics
    ○ Alerting - percolator
    ○ SIEM - pipe snort events, etc. into elasticsearch
    ○ srsly big data - can scale out to multiple clusters with tribe nodes
    Example Use Cases
    Why?

    View Slide

  7. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● Kibana gives the power to query to users directly without bothering ops, who we
    all know are already angry enough at computers
    ● Unstructured/schema-less documents (paired with type mappings) means you
    can be somewhat hands-off even more in terms of data ingress
    ● Less friction between dev and ops = happiness
    ● No charging for $/byte means power to log everything, forever
    ● Data lifecycle can be highly customized for graceful retirement & retention
    ● Native clustering and elasticity means scaling is dead easy
    ● Ops eye candy: a look at kopf, bigdesk, and paramedic
    ○ https://github.com/tylerjl/vagrant-elk-box
    Operational Benefits
    Why?

    View Slide

  8. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● ES works equally well PaaS or on-premise
    ● PaaS/Cloud
    ○ Remember: discovery.zen.ping.multicast.enabled: false
    ○ If on EC2, can use the EC2 plugin for host discovery
    ○ Use application or OS level raiding for speed boost
    ○ Don’t leave it open (CVE-2014-3120)
    ● On-premise
    ○ Good network throughput, fast disks, cores, 30GB RAM
    ○ Be aware of multicast
    ● Both:
    ○ Size appropriately (RAM, disk, cores)
    ○ Secure appropriately
    ○ Design appropriately
    Architecture
    How?

    View Slide

  9. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● Shield
    ○ Commercial plugin (i.e. comes with a support plan)
    ○ Pretty thoroughly vetted (pentested, been through a few releases)
    ○ Encryption throughout, RBAC, etc. etc.
    ● Otherwise…
    ○ Isolated subnet (avoid random joins)
    ○ Sit behind proxy to catch actions (nginx?)
    ○ Be aware of non-encrypted traffic/node chatter
    ○ Get security req’s up-front so you can design indices/types appropriately
    ○ Understand ES does not provide for access controls by default
    Other Operational Considerations
    How?

    View Slide

  10. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● The big question, spend time designing here:
    ● Sources
    ○ Application? Filesystem? Hardware devices?
    ● Transit
    ○ Open internet? Local network? Cloud?
    ● Storage/Retrieval
    ○ Access controls? Kibana or something else? What kind of latency/data
    expiry?
    Data in/Data out - Intro
    How?

    View Slide

  11. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● ES will guess at datatypes and will do pretty well (schemaless*...)
    ● How about custom mappings?
    ○ Dynamic mapping - i.e., tell ES to store every int_* field as integer, etc.
    ○ Reindex!
    ● Log buffering/HA
    ○ Fluentd: use file buffers to avoid loss
    ○ Logstash: pull from queue while FS buffering in dev
    ○ Both: rely on extraneous source for queuing, don’t want ruby being a buffer
    ● Data formats
    ○ Use native JSON when possible to simplify life (parsing eats CPU)
    ○ Grok makes this easier
    ○ For common formats (syslog, S3 access logs) there’s community stuff
    available
    Data in/Data out - Sources
    How?

    View Slide

  12. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● Open internet? SSL
    ○ Fluentd and logstash have this
    ○ Use some HA designs to avoid loss (i.e. archive all to S3, define multiple log
    endpoints)
    ● Enrich the data!
    ○ GeoIP, timestamp parsing, tagging, etc.
    ● Log passing
    ○ For most needs, just use native input/output plugins
    ○ Possibly to use native fluentd/lumberjack protocols
    ○ For native application calls? Either route stdout to log files or use message
    broker
    ● Avoid memory buffering, keep data safe!
    Data in/Data out - Transit
    How?

    View Slide

  13. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● Kibana
    ○ Either talk to local ES node or remote (local is nice for LB, but isn’t free)
    ○ Basic auth if needed (K4 passthrough)
    ● Beware cluster-killers
    ○ Huge time span facets/aggregations on analyzed fields
    ○ Way too much resident data for cluster size
    ○ Field lists that grow out of control (personal gripe)
    ● Devs will find new and creative ways to break it
    (don’t shoot yourself in the foot)
    Data in/Data out - Storage/Retrieval
    How?

    View Slide

  14. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    Or, Preparing For the Worst: An Ops Tale
    What Could Go Wrong?

    View Slide

  15. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● How to fix
    ○ See following slides on OOM
    ○ Decrease shard number - either change defaults or expire data
    ○ Get some RAID going on, either hardware or application
    ○ ES analytics (bigdesk, hot threads, caches)
    Taking time to tweak usage patterns and data schemas will go a long way. Use
    doc_values, dynamic mappings.
    Most often OOM, which takes us to...
    Unresponsiveness
    What Could Go Wrong?

    View Slide

  16. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● How to tell
    ○ Unresponsive nodes, slow queries
    ○ Tail the logs and watch it happen
    ● How to fix
    ○ ES_HEAP_SIZE to 50% of RAM, max 30GB
    ○ Make intelligent use of units of scale (shards, indices, etc.)
    ○ Spend a day reading the guide and tune usage patterns (doc_values,
    analyzed versus non, decrease field count, etc.)
    ○ Best practices will do a lot, scale out if there’s not much else to optimize
    OOM
    What Could Go Wrong?

    View Slide

  17. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● How to tell
    ○ CPU iowait times
    ● How to fix
    ○ Keep RAM balance 50/50 for lucene FS caches
    ○ RAID!
    ■ Either hardware or application-level
    ■ Gets you a cheap stripe, though SSDs will be easier
    ○ Scale out for parallelized reads
    I/O
    What Could Go Wrong?

    View Slide

  18. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    ● How to tell
    ○ Full disks?
    ○ When Elasticsearch stops allocating shards to full nodes
    ● How to fix
    ○ Snapshot indices to S3 and delete
    ○ Good workflow:
    ■ Optimize rotated indices -> close -> snapshot -> delete
    ○ ES is space-aware and will try to keep a cluster balanced space-wise
    ○ Alternatively, just scale out
    Disk Space (eventually)
    What Could Go Wrong?

    View Slide

  19. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    Questions?
    Q&A

    View Slide

  20. www.elastic.co Copyright Elastic 2015 Copying, publishing and/or
    distributing without written permission is strictly prohibited
    Information
    ● Elasticsearch documentation
    ○ www.elastic.co/guide
    ○ Elasticsearch - The Definitive Guide - for in-depth learning
    ○ Official documentation, API docs, etc.
    ○ Client library docs (javascript, ruby, python, java, php)
    ● Get involved in the ES community
    ○ www.elastic.co/community/meetups
    ○ SLC Meetup!
    ● Give feedback at: https://joind.in/talk/view/14000
    Additional Resources

    View Slide