Slide 1

Slide 1 text

elasticsearch: you know, for s/search/operations/ OpenWest 2015 Tyler Langlois

Slide 2

Slide 2 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited Speaker Bio ● Infrastructure Engineering @ Elastic ○ Previous: Qualtrics, Sandia National Laboratories, Blue Coat Systems, BYU ● Background in systems, security, *nix, smattering of different coding experience (scripting, web dev, devops) ● Happy as long as I’m automating things in a terminal ● Permanent mental bindings for vim and zsh leothrix tylerjl tjll.net Introduction

Slide 3

Slide 3 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● Prelude: Wat is this? ● Why is this useful for Ops? ● How? ○ Architecture (hardware, net, etc.) ○ Security (subnets, REST, etc.) ○ Data in/Data out ● What Could Go Wrong? ● Q&A What We’ll Cover Introduction

Slide 4

Slide 4 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● Scalable, fault-tolerant search and analytics engine ● Ideal for search, fits other cases excellently as well ● Open source, fast-moving, broad ecosystem ○ kopf, paramedic, marvel, bigdesk, client libraries, etc. etc… ○ Neat JS apps that run in browser and operate locally ● Has given rise to the ELK stack: ○ elasticsearch for storage ○ logstash for log/event processing ○ kibana for visualization Elasticsearch in a nutshell Prelude: What?

Slide 5

Slide 5 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited What sets ES apart What sets ES apart as a search platform? ● Free as in beer and in speech ● Paired with logstash, nearly infinite inputs and outputs (and dead easy to extend) ● Some nice ES-specific features ○ geo mapping, percolator, tribe, etc. ● Flurry of developer interest, lots of tutorials/use cases circulating ● Aside: Different data processing paradigm (pre- vs. post-) & tradeoffs (reminder: ask me about this at the end if you’re interested) Prelude: What?

Slide 6

Slide 6 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● ELK log analytics on the cheap (widely adopted, lots of development) ○ Generic logs from OS-level processes (/var/log/) ○ Application logs sent through message broker or other protocol ○ Hardware logs sent to syslog listeners ● Myriads of secondary use cases ○ Network analytics ○ Alerting - percolator ○ SIEM - pipe snort events, etc. into elasticsearch ○ srsly big data - can scale out to multiple clusters with tribe nodes Example Use Cases Why?

Slide 7

Slide 7 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● Kibana gives the power to query to users directly without bothering ops, who we all know are already angry enough at computers ● Unstructured/schema-less documents (paired with type mappings) means you can be somewhat hands-off even more in terms of data ingress ● Less friction between dev and ops = happiness ● No charging for $/byte means power to log everything, forever ● Data lifecycle can be highly customized for graceful retirement & retention ● Native clustering and elasticity means scaling is dead easy ● Ops eye candy: a look at kopf, bigdesk, and paramedic ○ https://github.com/tylerjl/vagrant-elk-box Operational Benefits Why?

Slide 8

Slide 8 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● ES works equally well PaaS or on-premise ● PaaS/Cloud ○ Remember: discovery.zen.ping.multicast.enabled: false ○ If on EC2, can use the EC2 plugin for host discovery ○ Use application or OS level raiding for speed boost ○ Don’t leave it open (CVE-2014-3120) ● On-premise ○ Good network throughput, fast disks, cores, 30GB RAM ○ Be aware of multicast ● Both: ○ Size appropriately (RAM, disk, cores) ○ Secure appropriately ○ Design appropriately Architecture How?

Slide 9

Slide 9 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● Shield ○ Commercial plugin (i.e. comes with a support plan) ○ Pretty thoroughly vetted (pentested, been through a few releases) ○ Encryption throughout, RBAC, etc. etc. ● Otherwise… ○ Isolated subnet (avoid random joins) ○ Sit behind proxy to catch actions (nginx?) ○ Be aware of non-encrypted traffic/node chatter ○ Get security req’s up-front so you can design indices/types appropriately ○ Understand ES does not provide for access controls by default Other Operational Considerations How?

Slide 10

Slide 10 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● The big question, spend time designing here: ● Sources ○ Application? Filesystem? Hardware devices? ● Transit ○ Open internet? Local network? Cloud? ● Storage/Retrieval ○ Access controls? Kibana or something else? What kind of latency/data expiry? Data in/Data out - Intro How?

Slide 11

Slide 11 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● ES will guess at datatypes and will do pretty well (schemaless*...) ● How about custom mappings? ○ Dynamic mapping - i.e., tell ES to store every int_* field as integer, etc. ○ Reindex! ● Log buffering/HA ○ Fluentd: use file buffers to avoid loss ○ Logstash: pull from queue while FS buffering in dev ○ Both: rely on extraneous source for queuing, don’t want ruby being a buffer ● Data formats ○ Use native JSON when possible to simplify life (parsing eats CPU) ○ Grok makes this easier ○ For common formats (syslog, S3 access logs) there’s community stuff available Data in/Data out - Sources How?

Slide 12

Slide 12 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● Open internet? SSL ○ Fluentd and logstash have this ○ Use some HA designs to avoid loss (i.e. archive all to S3, define multiple log endpoints) ● Enrich the data! ○ GeoIP, timestamp parsing, tagging, etc. ● Log passing ○ For most needs, just use native input/output plugins ○ Possibly to use native fluentd/lumberjack protocols ○ For native application calls? Either route stdout to log files or use message broker ● Avoid memory buffering, keep data safe! Data in/Data out - Transit How?

Slide 13

Slide 13 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● Kibana ○ Either talk to local ES node or remote (local is nice for LB, but isn’t free) ○ Basic auth if needed (K4 passthrough) ● Beware cluster-killers ○ Huge time span facets/aggregations on analyzed fields ○ Way too much resident data for cluster size ○ Field lists that grow out of control (personal gripe) ● Devs will find new and creative ways to break it (don’t shoot yourself in the foot) Data in/Data out - Storage/Retrieval How?

Slide 14

Slide 14 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited Or, Preparing For the Worst: An Ops Tale What Could Go Wrong?

Slide 15

Slide 15 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● How to fix ○ See following slides on OOM ○ Decrease shard number - either change defaults or expire data ○ Get some RAID going on, either hardware or application ○ ES analytics (bigdesk, hot threads, caches) Taking time to tweak usage patterns and data schemas will go a long way. Use doc_values, dynamic mappings. Most often OOM, which takes us to... Unresponsiveness What Could Go Wrong?

Slide 16

Slide 16 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● How to tell ○ Unresponsive nodes, slow queries ○ Tail the logs and watch it happen ● How to fix ○ ES_HEAP_SIZE to 50% of RAM, max 30GB ○ Make intelligent use of units of scale (shards, indices, etc.) ○ Spend a day reading the guide and tune usage patterns (doc_values, analyzed versus non, decrease field count, etc.) ○ Best practices will do a lot, scale out if there’s not much else to optimize OOM What Could Go Wrong?

Slide 17

Slide 17 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● How to tell ○ CPU iowait times ● How to fix ○ Keep RAM balance 50/50 for lucene FS caches ○ RAID! ■ Either hardware or application-level ■ Gets you a cheap stripe, though SSDs will be easier ○ Scale out for parallelized reads I/O What Could Go Wrong?

Slide 18

Slide 18 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited ● How to tell ○ Full disks? ○ When Elasticsearch stops allocating shards to full nodes ● How to fix ○ Snapshot indices to S3 and delete ○ Good workflow: ■ Optimize rotated indices -> close -> snapshot -> delete ○ ES is space-aware and will try to keep a cluster balanced space-wise ○ Alternatively, just scale out Disk Space (eventually) What Could Go Wrong?

Slide 19

Slide 19 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited Questions? Q&A

Slide 20

Slide 20 text

www.elastic.co Copyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited Information ● Elasticsearch documentation ○ www.elastic.co/guide ○ Elasticsearch - The Definitive Guide - for in-depth learning ○ Official documentation, API docs, etc. ○ Client library docs (javascript, ruby, python, java, php) ● Get involved in the ES community ○ www.elastic.co/community/meetups ○ SLC Meetup! ● Give feedback at: https://joind.in/talk/view/14000 Additional Resources