Elastic{ON} 2018 - Bigger, Faster, Stronger - Leveling Up Enterprise Logging

Slide 1

Slide 1 text

February 28, 2018 Bigger, Faster, Stronger: Leveling Up Enterprise Logging David Sarmanian General Dynamics IT, Elastic Solutions Lead Jared McQueen McQueen Solutions, Principal Systems Engineer

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

What problems are we really trying to solve? Too much data!!!

Slide 4

Slide 4 text

• Slow Data Ingest • Slow Query Times • Single Points of Failure • Visualizations • Very Costly to Scale Problems:

Slide 5

Slide 5 text

What is the next solution?

Slide 6

Slide 6 text

Customer Requirements/Wants 5+ Years Searchable 18 Months Online (Hot) Petabyte Search Simultaneous Queries High Sustained EPS Ingest Automation Fast Data Queries Redundancy Visualization and Metrics Dashboards Actionable Data

Slide 7

Slide 7 text

Time to pull our heads out of the sand What else is out there? • Magic Quadrant • Current Vendors • Recommended Vendors • Other

Slide 8

Slide 8 text

By the phases Here is the plan 8 Requirements Usability and Security 1 2 3 4 5 Order/Buy Build Test Migrate Historical Data And Build a Streaming Capability Regression And User Testing Deploy

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

• Now able to find the data in milliseconds instead of hours • Orchestration increases the uptime and lower O&S costs • Hundreds of users using Elastic without any performance issues • We have actionable visualization through the use of Kibana • Control cloud costs by keeping 6 months hot and 5 years searchable Key Takeaways

Slide 11

Slide 11 text

Let’s Get Technical

Slide 12

Slide 12 text

Quick Stats 12 2.6 Million indexed docs / second Delivered a long-term search capability 5+ years in seconds - IOI 10+ PB of logs indexed

Slide 13

Slide 13 text

13 Work Backwards • START by defining the end goal. Be specific. • Engineer in reverse Understand Constraints • Chase bottlenecks • “Instrument everything” • Don’t be afraid to go deep Scale Up • Only after optimization • Get parallel: Network, Compute, Storage Elastic Engineering Core Principals Works 60% of the time, every time

Slide 14

Slide 14 text

GETTING DATA OUT (Using the 3 principals)

Slide 15

Slide 15 text

Getting the data OUT 15 Working backwards - define the end goal End Goal Be as specific as possible: Compressed, multi-line JSON files  saved to on-prem SAN ??? JSON { }

Slide 16

Slide 16 text

Getting the data OUT 16 Bottlenecks everywhere JSON { } PL/SQL Disk I/O contention Lacked valid JSON output Slow cursors Slow concurrency Result: ~700 EPS Let’s try logstash

Slide 17

Slide 17 text

Getting the data OUT 17 More bottlenecks JSON { } logstash-input-jdbc logstash-codec-json logstash-output-file logstash-filter-metrics Result: ~1500 EPS Can we do better?

Slide 18

Slide 18 text

Getting the data OUT 18 More bottlenecks JSON { } Use tools at your disposal: - Node Stats - Hot Threads - htop, nload, iostats - visualVM to profile heap Slow JDBC transfers JSON serialization was expensive Let’s write our own app

Slide 19

Slide 19 text

Getting the data OUT 19 Custom Python App JSON { } Result were impressive: 1 Core: ~10k EPS 40 Cores, multiproc’d, threaded: ~100,000 EPS cx_Oracle great JSON support complete, granular control now scale up!

Slide 20

Slide 20 text

Getting the data OUT 20 Open the flood gates JSON { } Result across 10 boxes: ~1 Million EPS a quick word on scaling:

Slide 21

Slide 21 text

Getting the data OUT 21 A quick word on scaling Be careful! - impact infrastructure - degrade networks - disrupt storage - cascading compute - stress hardware* *Trust us, Pay attention to high/critical temperature warning sensors on bare metal boxes

Slide 22

Slide 22 text

GETTING DATA IN Early ~1.25M EPS test speed run

Slide 23

Slide 23 text

Getting the data IN 23 Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } High level challenges in the cloud: Network bottlenecks Reliability not guaranteed - many outages Getting data *into* cloud storage Absolutely use an orchestration tool!

Slide 24

Slide 24 text

Getting the data IN 24 many years’ worth of compressed JSON files sitting in S3 buckets Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }

Slide 25

Slide 25 text

Getting the data IN 25 Python app to pull down compressed JSON and ship to elastic. (32) m4.xl instances (4/16/high) bulk ingestors Elasticsearch Data Nodes dedicated coordinating node Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } boto-3 elasticsearch-py Built for speed: multiprocessing, threading 2:1 ratio ingest instances to coordinating nodes

Slide 26

Slide 26 text

Getting the data IN 26 es = Elasticsearch(maxsize=4) # limit connections results = helpers.parallel_bulk( client=coord_node, # dedicated coord node actions=getDocs(), # use a generator thread_count=8, # size of threadpool chunk_size=2500 # batch size ) see elasticsearch-py docs for more info Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { } bulk ingestors Elasticsearch Coordinating Nodes JSON { } ` optimizing elasticsearch-py

Slide 27

Slide 27 text

Getting the data IN 27 Data Nodes: hundreds of d2.2xl instances 12TB storage per box Orchestration: SALT Centos 7 ES 5.6, X-Pack Coordinating Nodes: (16) m4.xl instances (4/16/high) Unpack and route bulk requests moderately CPU intensive highly network bound Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }

Slide 28

Slide 28 text

Getting the data IN 28 Data Retention Strategies: Archive: Index historical data > snapshots > IOI > delete (optimized for ingest) Recent (within X days): Index with replicas. Keep online (balanced between ingest, usability, reliability) Cloud Architecture bulk ingestors Elasticsearch Coordinating Nodes JSON { }

Slide 29

Slide 29 text

Getting the data IN 29 "settings": { "index": { "codec": "best_compression", "refresh_interval": "-1", // disable refresh "number_of_shards": "10", // shoot for 40-60gb shards "number_of_replicas": "0", // danger! danger! "translog.durability": “async”, // fsync commit in the background "index.merge.scheduler.max_thread_count": “1" // spinning disks } } High indexing rates require optimizing settings and mappings Getting fast ingest speeds index settings for archive data: Caution! these settings are for short-term indexing! Don’t use for 30, 60, 90 day production data.

Slide 30

Slide 30 text

Get the data in 30 Enable case insensitivity "normalizer": { "to_lower": { "filter": [ "lowercase", "asciifolding" ], "type": "custom", "char_filter": [] } } “properties": { "sourceUserName": { "type": "keyword", "normalizer": "to_lower" } } Always normalize keywords: jmcqueen != jMcQueen for fast ingest, limit analyzed fields Use templates to define your index settings/mappings index settings: index mappings:

Slide 31

Slide 31 text

INDEX OF INDEX (AKA metadata AKA rollup, but IOI just sounds cooler)

Slide 32

Slide 32 text

IOI 32 Enterprise Logging invites interesting questions: When your system contains contains such a rich and wonderful data set Cyber Analysts: When was the first time we ever saw communication with an IP address? Approx time and frequency of unusual port activity? Network Operations: When was the last time a network device was updated, or patched? When was a box last seen on the network? Secret Squirrels / Information Assurance: request for user attributable data Validate security controls for systems The usual response: We can only go back X days

Slide 33

Slide 33 text

IOI 33 Enables historical search It’s expensive keep 5+ years enterprise logging online and searchable. “hot”/“warm”. After X days, daily indices are snapshot’ed to “cold” storage Most often, analysts want to search since the beginning of time IOI allows us to quickly search metadata about documents stored in offline indices. Serves as a starting point for further investigation, rehydration

Slide 34

Slide 34 text

IOI 34 General guidelines Engage with your analytical groups. Establish as list of valuable, low-cardinality fields. source and dest IP source and dest ports user names host, network (enclave) names email address event IDs Note: ‘event name’ is a poor choice, high cardinality

Slide 35

Slide 35 text

Get the data in 35 High level overview daily index Run IOI aggs snapshot old index alias new index Iterate through your list of “high-value” fields: run terms aggregation get the values and counts store them back into a new index for that day with a note in the doc that the data is archived Add newly-created summary index to enterprise logging alias

Slide 36

Slide 36 text

Get the data in 36 High level overview "_source": { "@timestamp": "2013-11-13", "userName": "jmcqueen", "count": 3, "note": "stored in archive" }

Slide 37

Slide 37 text

37 More Questions? Visit us at the AMA

Slide 38

Slide 38 text

www.elastic.co

Slide 39

Slide 39 text

Except where otherwise noted, this work is licensed under http://creativecommons.org/licenses/by-nd/4.0/ Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. 39 Please attribute Elastic with a link to elastic.co